date:20070817

Re: [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page()

2007-08-17 Thread Fengguang Wu

On Mon, Aug 13, 2007 at 06:30:00PM +0800, Fengguang Wu wrote:
> > On Sun, Aug 12, 2007 at 05:11:23PM +0800, Fengguang Wu wrote:
> > > Miklos Szeredi <[EMAIL PROTECTED]> and me identified a writeback bug:
> > > Basicly they are
> > > - during the dd: ~16M 
> > > - after 30s:  ~4M
> > > - after 5s:   ~4M
> > > - after 5s: ~176M
> > > 
> > > The box has 2G memory.
> > > 
> > > Question 1:
> > > How come the 5s delays? I run 4 tests in total, 2 of which have such 5s 
> > > delays.
> > 
> > pdflush runs every five seconds, so that is indicative of the inode being
> > written once for 1024 pages, and then delayed to the next pdflush run 5s 
> > later.
> > perhaps the inodes aren't moving between the lists exactly the way you
> > think they are...
> 
> Now I figured out the exact situation. When the scan of s_io finishes
> with some small inodes, nr_to_write will be positive, fooling kupdate
> to quit prematurely. But in fact the big dirty file is on s_more_io
> waiting for more io... The attached patch fixes it.
> 
> Fengguang
> ===
> 
> Subject: writeback: introduce writeback_control.more_io to indicate more io
> 
> After making dirty a 100M file, the normal behavior is to
> start the writeback for all data after 30s delays. But
> sometimes the following happens instead:
> 
>   - after 30s:~4M
>   - after 5s: ~4M
>   - after 5s: all remaining 92M
> 
> Some analyze shows that the internal io dispatch queues goes like this:
> 
>   s_ios_more_io
>   -
>   1)  100M,1K 0
>   2)  1K  96M
>   3)  0   96M
> 
> 1) initial state with a 100M file and a 1K file
> 2) 4M written, nr_to_write <= 0, so write more
> 3) 1K written, nr_to_write > 0, no more writes(BUG)
> 
> nr_to_write > 0 in 3) fools the upper layer to think that data have all been
> written out. Bug the big dirty file is still sitting in s_more_io. We cannot
> simply splice s_more_io back to s_io as soon as s_io becomes empty, and let 
> the
> loop in generic_sync_sb_inodes() continue: this may starve newly expired 
> inodes
> in s_dirty.  It is also not an option to draw inodes from both s_more_io and
> s_dirty, an let the loop go on: this might lead to live locks, and might also
> starve other superblocks in sync time(well kupdate may still starve some
> superblocks, that's another bug).
> 
> So we have to return when a full scan of s_io completes. So nr_to_write > 0 
> does
> not necessarily mean that "all data are written". This patch introduces a flag
> writeback_control.more_io to indicate this situation. With it the big dirty 
> file
> no longer has to wait for the next kupdate invocation 5s later.

Sorry, this patch is found to be dangerous. It locks up my desktop
on heavy I/O: kupdate *immediately* returns to push the file in
s_more_io for writeback, but it *could* still not able to make
progress(locks etc.). Now kupdate ends up *busy looping*.  That could
be fixed by wait for somehow 100ms and retry the io. Should we do
it?(or: Is 5s interval considered too long a wait?)

> Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
> ---
>  fs/fs-writeback.c |2 ++
>  include/linux/writeback.h |1 +
>  mm/page-writeback.c   |9 ++---
>  3 files changed, 9 insertions(+), 3 deletions(-)
> 
> --- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
> +++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
> @@ -560,6 +560,8 @@ int generic_sync_sb_inodes(struct super_
>   if (wbc->nr_to_write <= 0)
>   break;
>   }
> + if (!list_empty(&sb->s_more_io))
> + wbc->more_io = 1;
>   spin_unlock(&inode_lock);
>   return ret; /* Leave any unwritten inodes on s_io */
>  }
> --- linux-2.6.23-rc2-mm2.orig/include/linux/writeback.h
> +++ linux-2.6.23-rc2-mm2/include/linux/writeback.h
> @@ -61,6 +61,7 @@ struct writeback_control {
>   unsigned for_reclaim:1; /* Invoked from the page allocator */
>   unsigned for_writepages:1;  /* This is a writepages() call */
>   unsigned range_cyclic:1;/* range_start is cyclic */
> + unsigned more_io:1; /* more io to be dispatched */
>  
>   void *fs_private;   /* For use by ->writepages() */
>  };
> --- linux-2.6.23-rc2-mm2.orig/mm/page-writeback.c
> +++ linux-2.6.23-rc2-mm2/mm/page-writeback.c
> @@ -382,6 +382,7 @@ static void background_writeout(unsigned
>   global_page_state(NR_UNSTABLE_NFS) < background_thresh
>   && min_pages <= 0)
>   break;
> + wbc.more_io = 0;
>   wbc.encountered_congestion = 0;
>   wbc.nr_to_write = MAX_WRITEBACK_PAGES;
>   wbc.pages_skipped = 0;
> @@ -389,8 +390,9 @@ static void background_writeout(unsigned
>   min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
>   if (wbc.nr_to_write > 0 ||

Re: [PATCH 002 of 6] Introduce rq_for_each_segment replacing rq_for_each_bio

2007-08-17 Thread Geert Uytterhoeven

On Thu, 16 Aug 2007, NeilBrown wrote:
> Every usage of rq_for_each_bio wraps a usage of
> bio_for_each_segment, so these can be combined into
> rq_for_each_segment.
> 
> We define "struct req_iterator" to hold the 'bio' and 'index' that
> are needed for the double iteration.

> --- .prev/drivers/block/ps3disk.c 2007-08-16 20:37:26.0 +1000
> +++ ./drivers/block/ps3disk.c 2007-08-16 20:50:07.0 +1000
> @@ -91,30 +91,30 @@ static void ps3disk_scatter_gather(struc
>  struct request *req, int gather)
>  {
>   unsigned int offset = 0;
> - struct bio *bio;
> - sector_t sector;
> + struct req_iterator iter;
>   struct bio_vec *bvec;
> - unsigned int i = 0, j;
> + unsigned int i = 0;
>   size_t size;
>   void *buf;
>  
> - rq_for_each_bio(bio, req) {
> - sector = bio->bi_sector;
> + rq_for_each_segment(bvec, req, iter) {
> + unsigned long flags;
>   dev_dbg(&dev->sbd.core,
>   "%s:%u: bio %u: %u segs %u sectors from %lu\n",
> - __func__, __LINE__, i, bio_segments(bio),
> - bio_sectors(bio), sector);
> - bio_for_each_segment(bvec, bio, j) {
> + __func__, __LINE__, i, bio_segments(iter.bio),
> + bio_sectors(iter.bio),
> + (unsigned long)iter.bio->bi_sector);
^^^
Superfluous cast: PS3 is 64-bit only, and casts are evil.

> +
>   size = bvec->bv_len;
> - buf = __bio_kmap_atomic(bio, j, KM_IRQ0);
> + buf = bvec_kmap_irq(bvec, flags);

As you already mentioned in the preample of the patch series, this changes
functionality.

>   if (gather)
>   memcpy(dev->bounce_buf+offset, buf, size);
>   else
>   memcpy(buf, dev->bounce_buf+offset, size);
>   offset += size;
>   flush_kernel_dcache_page(bio_iovec_idx(bio, 
> j)->bv_page);
> - __bio_kunmap_atomic(bio, KM_IRQ0);
> - }
> + bio_kunmap_bvec(bvec, flags);
> +
>   i++;
>   }
>  }

With kind regards,
 
Geert Uytterhoeven
Software Architect

Sony Network and Software Technology Center Europe
The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium
 
Phone:+32 (0)2 700 8453 
Fax:  +32 (0)2 700 8622 
E-mail:   [EMAIL PROTECTED] 
Internet: http://www.sony-europe.com/

Sony Network and Software Technology Center Europe  
A division of Sony Service Centre (Europe) N.V. 
Registered office: Technologielaan 7 · B-1840 Londerzeel · Belgium  
VAT BE 0413.825.160 · RPR Brussels  
Fortis Bank Zaventem · Swift GEBABEBB08A · IBAN BE39001382358619

[Announce] RHEL5 LSPP/EAL4 Certification Testsuite has been released

2007-08-17 Thread Subrata Modak


Dear All,

The Red Hat Enterprise Linux 5 LSPP/EAL4 Certification Testsuite  has 
been released and can be found at http://ltp.sourceforge.net/ or 
http://sourceforge.net/project/showfiles.php?group_id=3382. You can find 
more details about this testsuite from 'lspp/README' contained in the 
package.


Regards & Thanks--
Subrata Modak

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/23] per device dirty throttling -v9

2007-08-17 Thread Peter Zijlstra

On Thu, 2007-08-16 at 14:29 -0700, Christoph Lameter wrote:
> Is there any way to make the global limits on which the dirty rate 
> calculations are based cpuset specific?
> 
> A process is part of a cpuset and that cpuset has only a fraction of 
> memory of the whole system. 
> 
> And only a fraction of that fraction can be dirtied. We do not currently 
> enforce such limits which can cause the amount of dirty pages in 
> cpusets to become excessively high. I have posted several patchsets that 
> deal with that issue. See http://lkml.org/lkml/2007/1/16/5
> 
> It seems that limiting dirty pages in cpusets may be much easier to 
> realize in the context of this patchset. The tracking of the dirty pages 
> per node is not necessary if one would calculate the maximum amount of 
> dirtyable pages in a cpuset and use that as a base, right?

Currently we do: 
  dirty = total_dirty * bdi_completions_p * task_dirty_p

As dgc pointed out before, there is the issue of bdi/task correlation,
that is, we do not track task dirty rates per bdi, so now a task that
heavily dirties on one bdi will also get penalised on the others (and
similar issues).

If we were to change it so:
  dirty = cpuset_dirty * bdi_completions_p * task_dirty_p

We get additional correlation issues: cpuset/bdi, cpuset/task.
Which could yield surprising results if some bdis are strictly per
cpuset.

The cpuset/task correlation has a strict mapping and could be solved by
keeping the vm_dirties counter per cpuset. However, this would seriously
complicate the code and I'm not sure if it would gain us much.

Anyway, things to ponder. But overall it should be quite doable.

signature.asc
Description: This is a digitally signed message part

Re: Fork Bombing Patch

2007-08-17 Thread Paul Jackson

Petr wrote:
> Please do not add comments inside functions.

I find this advice a bit odd.  I am not aware of
any prohibition of comments inside functions.

As with comments outside functions, they should
serve a worthwhile purpose, of course.  One might
debate whether this particular comment added by
Anand was sufficiently valuable to be worth
having.

But I don't agree to a blanket prohibition on
comments inside functions.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 000 of 6] A few block-layer tidy-up patches.

2007-08-17 Thread Neil Brown

On Friday August 17, [EMAIL PROTECTED] wrote:
> > > Please inspect the #block-2.6.24 branch to see the result.
> > 
> > I don't know where to look for this.  I checked
> >   http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git
> > but they don't seem to be there.
> > ??
> 
> That's where it is, but the kernel.org mirroring is just horrible slow.
> Just checked now and it's there.

I discovered I was looking in the wrong place - not being very
familiar with git terminology.
I found them and it looks right.

I had a bit of a look at removing bio_data and ->buffer ...  the
usages outside of drivers/ide are fairly easy to deal with - I might
send a patch for that.  The drivers/ide stuff looks like a fairly
substantial rewrite is in order.
e.g.  idefloppy_packet_command_s seems to duplicate a lot of 
  fields from 'struct request', and it should probably use the request
  struct directly.

But a number of times ->buffer points to ->cmd, and there is no bio.
I guess we should use bio_map_kern to make a bio?

I'll see if I can come up with something testing it might be
awkward.  I have an ide cdrom I can test on.  Maybe an ide disk,, but
not an ide floppy :-)

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nanosleep() accuracy

2007-08-17 Thread Michal Schmidt


GolovaSteek skrev:

Hello!
I need use sleep with accurat timing.
I use 2.6.21 with rt-prempt patch.
with enabled rt_preempt, dyn_ticks, and local_apic
But

req.tv_nsec = 30;
req.tv_sec = 0;
nanosleep(&req,NULL)

make pause around 310-330 microseconds.


How do you measure this?
If you want to have something done every 300 microseconds, you must not 
sleep for 300 microseconds in each iteration, because you'd accumulate 
errors. Use a periodic timer or use the current time to compute how long 
to sleep in each iteration. Take a look how cyclictest does it.



I tried to understend how work nanosleep(), but it not depends from
jiffies and from smp_apic_timer_interrupt.

When can accuracy be lost?
And how are process waked up?


GolovaSteek


Don't forget the process will always have non-zero wakeup latency. It 
takes some time to process an interrupt, wakeup the process and schedule 
it to run on the CPU. 10-30 microseconds is not unreasonable.


Michal
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin


Satyam Sharma wrote:


#define atomic_read_volatile(v) \
({  \
forget((v)->counter);\
((v)->counter);  \
})

where:


*vomit* :)

Not only do I hate the keyword volatile, but the barrier is only a
one-sided affair so its probable this is going to have slightly
different allowed reorderings than a real volatile access.

Also, why would you want to make these insane accessors for atomic_t
types? Just make sure everybody knows the basics of barriers, and they
can apply that knowledge to atomic_t and all other lockless memory
accesses as well.



#define forget(a)   __asm__ __volatile__ ("" :"=m" (a) :"m" (a))


I like order(x) better, but it's not the most perfect name either.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma



On Thu, 16 Aug 2007, Paul E. McKenney wrote:

> On Fri, Aug 17, 2007 at 07:59:02AM +0800, Herbert Xu wrote:
> > On Thu, Aug 16, 2007 at 09:34:41AM -0700, Paul E. McKenney wrote:
> > >
> > > The compiler can also reorder non-volatile accesses.  For an example
> > > patch that cares about this, please see:
> > > 
> > >   http://lkml.org/lkml/2007/8/7/280
> > > 
> > > This patch uses an ORDERED_WRT_IRQ() in rcu_read_lock() and
> > > rcu_read_unlock() to ensure that accesses aren't reordered with respect
> > > to interrupt handlers and NMIs/SMIs running on that same CPU.
> > 
> > Good, finally we have some code to discuss (even though it's
> > not actually in the kernel yet).
> 
> There was some earlier in this thread as well.

Hmm, I never quite got what all this interrupt/NMI/SMI handling and
RCU business you mentioned earlier was all about, but now that you've
pointed to the actual code and issues with it ...


> > First of all, I think this illustrates that what you want
> > here has nothing to do with atomic ops.  The ORDERED_WRT_IRQ
> > macro occurs a lot more times in your patch than atomic
> > reads/sets.  So *assuming* that it was necessary at all,
> > then having an ordered variant of the atomic_read/atomic_set
> > ops could do just as well.
> 
> Indeed.  If I could trust atomic_read()/atomic_set() to cause the compiler
> to maintain ordering, then I could just use them instead of having to
> create an  ORDERED_WRT_IRQ().  (Or ACCESS_ONCE(), as it is called in a
> different patch.)

+#define WHATEVER(x)(*(volatile typeof(x) *)&(x))

I suppose one could want volatile access semantics for stuff that's
a bit-field too, no?

Also, this gives *zero* "re-ordering" guarantees that your code wants
as you've explained it below) -- neither w.r.t. CPU re-ordering (which
probably you don't care about) *nor* w.r.t. compiler re-ordering
(which you definitely _do_ care about).


> > However, I still don't know which atomic_read/atomic_set in
> > your patch would be broken if there were no volatile.  Could
> > you please point them out?
> 
> Suppose I tried replacing the ORDERED_WRT_IRQ() calls with
> atomic_read() and atomic_set().  Starting with __rcu_read_lock():
> 
> o If "ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++"
>   was ordered by the compiler after
>   "ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1", then
>   suppose an NMI/SMI happened after the rcu_read_lock_nesting but
>   before the rcu_flipctr.
> 
>   Then if there was an rcu_read_lock() in the SMI/NMI
>   handler (which is perfectly legal), the nested rcu_read_lock()
>   would believe that it could take the then-clause of the
>   enclosing "if" statement.  But because the rcu_flipctr per-CPU
>   variable had not yet been incremented, an RCU updater would
>   be within its rights to assume that there were no RCU reads
>   in progress, thus possibly yanking a data structure out from
>   under the reader in the SMI/NMI function.
> 
>   Fatal outcome.  Note that only one CPU is involved here
>   because these are all either per-CPU or per-task variables.

Ok, so you don't care about CPU re-ordering. Still, I should let you know
that your ORDERED_WRT_IRQ() -- bad name, btw -- is still buggy. What you
want is a full compiler optimization barrier().

[ Your code probably works now, and emits correct code, but that's
  just because of gcc did what it did. Nothing in any standard,
  or in any documented behaviour of gcc, or anything about the real
  (or expected) semantics of "volatile" is protecting the code here. ]


> o If "ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1"
>   was ordered by the compiler to follow the
>   "ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx", and an NMI/SMI
>   happened between the two, then an __rcu_read_lock() in the NMI/SMI
>   would incorrectly take the "else" clause of the enclosing "if"
>   statement.  If some other CPU flipped the rcu_ctrlblk.completed
>   in the meantime, then the __rcu_read_lock() would (correctly)
>   write the new value into rcu_flipctr_idx.
> 
>   Well and good so far.  But the problem arises in
>   __rcu_read_unlock(), which then decrements the wrong counter.
>   Depending on exactly how subsequent events played out, this could
>   result in either prematurely ending grace periods or never-ending
>   grace periods, both of which are fatal outcomes.
> 
> And the following are not needed in the current version of the
> patch, but will be in a future version that either avoids disabling
> irqs or that dispenses with the smp_read_barrier_depends() that I
> have 99% convinced myself is unneeded:
> 
> o nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
> 
> o idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed) & 0x1;
> 
> Furthermore, in that future version, irq handlers can cause the same
> mischief that SMI/NMI handlers can in this version.
> 
> Next, looking

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Stefan Richter

Nick Piggin wrote:
> I don't know why people would assume volatile of atomics. AFAIK, most
> of the documentation is pretty clear that all the atomic stuff can be
> reordered etc. except for those that modify and return a value.

Which documentation is there?

For driver authors, there is LDD3.  It doesn't specifically cover
effects of optimization on accesses to atomic_t.

For architecture port authors, there is Documentation/atomic_ops.txt.
Driver authors also can learn something from that document, as it
indirectly documents the atomic_t and bitops APIs.

Prompted by this thread, I reread this document, and indeed, the
sentence "Unlike the above routines, it is required that explicit memory
barriers are performed before and after [atomic_{inc,dec}_return]"
indicates that atomic_read (one of the "above routines") is very
different from all other atomic_t accessors that return values.

This is strange.  Why is it that atomic_read stands out that way?  IMO
this API imbalance is quite unexpected by many people.  Wouldn't it be
beneficial to change the atomic_read API to behave the same like all
other atomic_t accessors that return values?

OK, it is also different from the other accessors that return data in so
far as it doesn't modify the data.  But as driver "author", i.e. user of
the API, I can't see much use of an atomic_read that can be reordered
and, more importantly, can be optimized away by the compiler.  Sure, now
that I learned of these properties I can start to audit code and insert
barriers where I believe they are needed, but this simply means that
almost all occurrences of atomic_read will get barriers (unless there
already are implicit but more or less obvious barriers like msleep).
-- 
Stefan Richter
-=-=-=== =--- =---=
http://arcgraph.de/sr/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC 3/4]Introduce "account modifiers" mechanism

2007-08-17 Thread Laurent Vivier

Rusty Russell wrote:
> On Thu, 2007-08-16 at 17:58 +0200, Laurent Vivier wrote:
>> [PATCH 3/3] introduce "account modifiers" mechanism in the kernel allowing a
>> module to modify the collected accounting for a given task. This 
>> implementation
>> is based on the "preempt_notifier". "account_system_time()" and
>> "account_user_time()" can call functions registered by a module to modify the
>> cputime value.
>>
>> Signed-off-by: Laurent Vivier <[EMAIL PROTECTED]>
> 
> 
> Hi Laurent,

Hi Rusty,
how are your puppies ?
And thank you for your comment.

>   This seems a little like overkill.  Why not just add an
> "account_guest_time" which subtracts the given amount of time from
> system time (if available) and adds it to guest time?  Then kvm (and
> lguest) should just need to call this at the right times.

We can. I did something like this before.
By doing like that, I think there is a major issue: system time can be
decreasing (as we substract a value from it), and thus we can have negative
value in a tool like top. It's why I play with the cputime to add to system time
and not directly with the system time.

BUT I'm very open, my only goal is be able to compute guest time, "how" is not
very important...

what we can do:

- keep PATCHES 1 and 2, because we need to store guest time for cpu and tasks.
It is very generic.

- remove PATCH 3, and add in task_struct a "ktime vtime" where we accumulate
guest time (by calling something like guest_enter() and guest_exit() from the
virtualization engine), and when in account_system_time() we have cputime >
vtime we substrate vtime from cputime and add vtime to user time and guest time.
But doing like this we freeze in kernel/sched.c the link between system time,
user time and guest time (i.e. system time = system time - vtime, user time =
user time + vtime and guest time = guest time + vtime).

- modify PATCH 4 to use new PATCH 3.

Do you agree ? Anybody doesn't agree ?

Laurent
-- 
- [EMAIL PROTECTED]  --
  "Software is hard" - Donald Knuth



signature.asc
Description: OpenPGP digital signature

Re: Fork Bombing Patch

2007-08-17 Thread Petr Tesarik

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Paul Jackson wrote:
> Petr wrote:
>> Please do not add comments inside functions.
> 
> I find this advice a bit odd.  I am not aware of
> any prohibition of comments inside functions.
> 
> As with comments outside functions, they should
> serve a worthwhile purpose, of course.  One might
> debate whether this particular comment added by
> Anand was sufficiently valuable to be worth
> having.
> 
> But I don't agree to a blanket prohibition on
> comments inside functions.

I'm not saying that comments inside functions should be prohibited, but
comments inside functions often lead to over-commenting. There must be a
good reason for adding such a comment. See CodingStyle, chapter 8:
Commenting:

Also, try to avoid putting comments inside a function body.

(Before somebody starts arguing with this one sentence, please also read
the rest of the chapter; it is not long and you'll understand the
author's intention better.)

Kind regards,
Petr Tesarik
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGxVFLjpY2ODFi2ogRAoLUAJwI+cywi9iKHWlx4yora0+WJfCEawCglyrf
xyucPIB3W63sbM1dw/Nsv2Y=
=SL8f
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Cluster-devel] Re: [gfs2][RFC] readdir caused ls process into D (uninterruptible) state, under testing with Samba 3.0.25

2007-08-17 Thread rae l

On 8/16/07, Steven Whitehouse <[EMAIL PROTECTED]> wrote:
> Hi,
>
> On Thu, 2007-08-16 at 16:20 +0800, 程任全 wrote:
> > It seems that gfs2 cannot work well with Samba,
> >
> > I'm using the gfs2 and the new cluster suite(cman with openais),
> >
> > 1. the testing environment is that 1 iscsi target and 2 cluster node,
> > 2. the two nodes both used iscsi initiator connect to the target,
> > 3. they're using the same physical iscsi disk,
> > 4. run LVM2 on top of the same iscsi disk,
> > 5. on the same lv (logical volume), I created a gfs2 filesystem,
> > 6. mount the gfs2 system to a same path under 2 nodes,
> > 7. start samba to shared the gfs2 mounting pointer on the 2 nodes,
> >
> > now test with windows client, when two or above clients connects to the 
> > samba,
> > everything is still normal; but when heavy writers or readers start,
> > the samba server daemon changed to D state, that's uninterruptible in
> > the kernel,
> > I wonder that's a problem of gfs2?
> >
> Which version of gfs2 are you using? GFS2 doesn't support leases which I
> know that Samba uses, however only relatively recent kernels have been
> able to report that fact via the VFS.
>
> > then I start a simple ls command on the gfs2 mouting point:
> > $ ls /mnt/gfs2
> > the ls process is also changed to D state,
> >
> > I think it's problems about readdir implementation in gfs2, and I want
> > to fix it, someone could give me some pointers?
> >
> Can you get a stack trace? echo 't' >/proc/sysrq-trigger
> That should show where Samba is getting stuck,
>
> Steve.
the stack trace of the 'D' state `ls`:

 ===
lsD F89B83F8  2200 12018  1 (NOTLB)
   f3eeadd4 0082 f6a425c0 f89b83f8 f3eead9c f6a425d4 f6f32d80 f573a93c
   0001 f89b83f3  c40a2030 c3fa9fa0 c40aaa70 c40aab7c 0e89
   b2a4b036 02e4 c40a2030 f3eeae1c  c3f85e98 f8e11e09 f8e11e0e
Call Trace:
 [] gdlm_bast+0x0/0x93 [lock_dlm]
 [] gdlm_ast+0x0/0x5 [lock_dlm]
 [] holder_wait+0x0/0x8 [gfs2]
 [] holder_wait+0x5/0x8 [gfs2]
 [] __wait_on_bit+0x2c/0x51
 [] out_of_line_wait_on_bit+0x6f/0x77
 [] holder_wait+0x0/0x8 [gfs2]
 [] wake_bit_function+0x0/0x3c
 [] wake_bit_function+0x0/0x3c
 [] wait_on_holder+0x3c/0x40 [gfs2]
 [] glock_wait_internal+0x81/0x1a3 [gfs2]
 [] gfs2_glock_nq+0x5e/0x79 [gfs2]
 [] gfs2_getattr+0x72/0xb5 [gfs2]
 [] gfs2_getattr+0x6b/0xb5 [gfs2]
 [] do_path_lookup+0x17a/0x1c3
 [] gfs2_getattr+0x0/0xb5 [gfs2]
 [] vfs_getattr+0x3e/0x51
 [] vfs_lstat_fd+0x2b/0x3d
 [] do_path_lookup+0x17a/0x1c3
 [] mntput_no_expire+0x11/0x6e
 [] sys_lstat64+0xf/0x23
 [] sys_symlinkat+0x81/0xb5
 [] sysenter_past_esp+0x5d/0x81
 [] __ipv6_addr_type+0x88/0xb8

the system is still running, so the mormal 'R' and 'S' state process
are ignored, But it turns out that it's not the readdir's fault from
this call trace, but gdlm_bast's problem in lock_dlm module.

>
>
>


-- 
Denis Cheng
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nanosleep() accuracy

2007-08-17 Thread GolovaSteek

2007/8/17, Michal Schmidt <[EMAIL PROTECTED]>:
> GolovaSteek skrev:
> > Hello!
> > I need use sleep with accurat timing.
> > I use 2.6.21 with rt-prempt patch.
> > with enabled rt_preempt, dyn_ticks, and local_apic
> > But
> >
> > req.tv_nsec = 30;
> > req.tv_sec = 0;
> > nanosleep(&req,NULL)
> >
> > make pause around 310-330 microseconds.
>
> How do you measure this?
> If you want to have something done every 300 microseconds, you must not
> sleep for 300 microseconds in each iteration, because you'd accumulate
> errors. Use a periodic timer or use the current time to compute how long
> to sleep in each iteration. Take a look how cyclictest does it.

no. I just want my programm go to sleep sometimes and wake up in correct time.

> > I tried to understend how work nanosleep(), but it not depends from
> > jiffies and from smp_apic_timer_interrupt.
> >
> > When can accuracy be lost?
> > And how are process waked up?
> >
> >
> > GolovaSteek
>
> Don't forget the process will always have non-zero wakeup latency. It
> takes some time to process an interrupt, wakeup the process and schedule
> it to run on the CPU. 10-30 microseconds is not unreasonable.

But 2 operations can be done in 10 microseconds?
and why is there that inconstancy? Why sametimes 10 and sometimes 30?
In which points of implementation it happens?

GolovaSteek
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Only initialize hvc_console if needed, cleanup Kconfig help

2007-08-17 Thread Stephen Rothwell

On Fri, 17 Aug 2007 11:25:46 +1000 Rusty Russell <[EMAIL PROTECTED]>
wrote:
>
> diff -r 0730da2377be drivers/char/Kconfig
> --- a/drivers/char/KconfigTue Aug 14 12:46:08 2007 +1000
> +++ b/drivers/char/KconfigFri Aug 17 09:05:12 2007 +1000
> @@ -568,10 +568,10 @@ config HVC_DRIVER
>  config HVC_DRIVER
>   bool
>   help
> -   Users of pSeries machines that want to utilize the hvc console 
> front-end
> -   module for their backend console driver should select this option.
> -   It will automatically be selected if one of the back-end console 
> drivers
> -   is selected.
> +   Generic "hypervisor virtual console" infrastructure for various
> +   hypervisors (pSeries, Xen, lguest).
  ^^^
You left out iSeries.  Probably better to just not enumerate them as they
may increase over time.

Otherwise looks good (though I haven't actually built/booted it).
-- 
Cheers,
Stephen Rothwell[EMAIL PROTECTED]
http://www.canb.auug.org.au/~sfr/


pgpLa3V67ZBdQ.pgp
Description: PGP signature

Re: Early printk behaviour

2007-08-17 Thread Gerd Hoffmann

Mike Frysinger wrote:
>> Hmm, sort of, although I didn't think about the case of no real console
>> replacing the early console.  The intention of the patch is to have a
>> smooth handover from the boot console to the real console.  And, yes, if
>> no real console is ever registered the boot console keeps running ...
> 
> i think it also occurs in the case where real console != early console

No.  At least not of the boot console has the CON_BOOT flag set as it
should.  Last message you'll see on the boot console is the handover
printk, telling you which real console device prints the following
messages.  Whenever early and real console go to the physical device or
not doesn't matter.

>> So you can either let it running and *not* mark it __init, so it can
>> keep on going without breaking.  Or you can explicitly unregister your
>> boot console at some point, maybe using a late_initcall.
> 
> wouldnt a common kernel late_initcall() be more appropriate ?  if
> early console hasnt switched over (for whatever reason), then kill it

Hmm, yes, should be doable in generic code.  Check whenever the current
console has CON_BOOT set and if so unregister it.

cheers,

  Gerd

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nanosleep() accuracy

2007-08-17 Thread Jan Engelhardt

On Aug 17 2007 11:44, GolovaSteek wrote:
>> How do you measure this?
>> If you want to have something done every 300 microseconds, you must not
>> sleep for 300 microseconds in each iteration, because you'd accumulate
>> errors. Use a periodic timer or use the current time to compute how long
>> to sleep in each iteration. Take a look how cyclictest does it.
>
>no. I just want my programm go to sleep sometimes and wake up in correct time.

Would it be acceptable to use an optimistic strategy, like the one below?

Let's say that the following tasks happen at each time: A at 0, B at 300, C at
600, D at 900, E at 1200, F at 1500. Assume sleeping takes 500 µs.
Then B and C could be run at 500, D at 1000 and E,F at 1500.

Jan
--

Re: Need help with modules loading

2007-08-17 Thread Kay Sievers

On 8/17/07, Larry Finger <[EMAIL PROTECTED]> wrote:
> A new driver for the Broadcom BCM43xx devices has been written that uses 
> mac80211, rather than
> softmac. The newest versions of the Broadcom firmware does not support all 
> the BCM devices.
> Accordingly, a separate driver is being prepared that will use an older 
> version of the firmware and
> support these legacy devices. Unfortunately, there is not a clean separation 
> based on PCI id's;
> however, the revision level of the 802.11 wireless core can be used to 
> determine which driver should
> be used. The scheme works on most systems, but not mine and I need some help 
> to discover why.

> The 'MODALIAS=ssb:v4243id0812rev0A' line is correct for my device. In fact 
> issuing a modprobe
> "ssb:v4243id0812rev0A" command results in the loading of the module. For some 
> reason, this does not
> happen automatically.
>
> Initially, I suspected that my version of udev (103-13) was too old; however, 
> upgrading to version
> 114 did not help. My module-init-tools are V 3.2.2 and my distro is the 
> x86_64 version of openSUSE 10.2.

openSUSE 10.2 used a whitelist of buses which trigger module loading.
It's in the udev sysconfig. rules and /sbin/hwup.

The easiest is probably to add a rule for that bus:
  ACTION=="add", SUBSYSTEM=="ssb", ENV{MODALIAS}=="?*",
RUN+="/sbin/modprobe $env{MODALIAS}"

openSUSE 10.3 will call modprobe directly, the whitelist and the whole
hwup logic is removed in the meantime.

Kay
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sysfs: don't warn on removal of a nonexistent binary file

2007-08-17 Thread Cornelia Huck

On Fri, 17 Aug 2007 10:19:24 +0900,
Tejun Heo <[EMAIL PROTECTED]> wrote:

> Alan Stern wrote:
> > This patch (as960) removes the error message and stack dump logged by
> > sysfs_remove_bin_file() when someone tries to remove a nonexistent
> > file.  The warning doesn't seem to be needed, since none of the other
> > file-, symlink-, or directory-removal routines in sysfs complain in a
> > comparable way.
> > 
> > Signed-off-by: Alan Stern <[EMAIL PROTECTED]>
> 
> Acked-by: Tejun Heo <[EMAIL PROTECTED]>

Acked-by: Cornelia Huck <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lockdep: annotate rcu_read_{,un}lock()

2007-08-17 Thread Peter Zijlstra

On Thu, 2007-08-16 at 09:01 -0700, Paul E. McKenney wrote:
> On Thu, Aug 16, 2007 at 04:25:07PM +0200, Peter Zijlstra wrote:
> > 
> > There seem to be some unbalanced rcu_read_{,un}lock() issues of late,
> > how about doing something like this:
> 
> This will break when rcu_read_lock() and rcu_read_unlock() are invoked
> from NMI/SMI handlers -- the raw_local_irq_save() in lock_acquire() will
> not mask NMIs or SMIs.
> 
> One approach would be to check for being in an NMI/SMI handler, and
> to avoid calling lock_acquire() and lock_release() in those cases.

It seems:

#define nmi_enter() do { lockdep_off(); __irq_enter(); } while (0)
#define nmi_exit()  do { __irq_exit(); lockdep_on(); } while (0)

Should make it all work out just fine. (for NMIs at least, /me fully
ignorant of the workings of SMIs)

> Another approach would be to use sparse, which has checks for
> rcu_read_lock()/rcu_read_unlock() nesting.

Yeah, but one more method can never hurt, no? :-)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin


Stefan Richter wrote:

Nick Piggin wrote:


I don't know why people would assume volatile of atomics. AFAIK, most
of the documentation is pretty clear that all the atomic stuff can be
reordered etc. except for those that modify and return a value.



Which documentation is there?


Documentation/atomic_ops.txt



For driver authors, there is LDD3.  It doesn't specifically cover
effects of optimization on accesses to atomic_t.

For architecture port authors, there is Documentation/atomic_ops.txt.
Driver authors also can learn something from that document, as it
indirectly documents the atomic_t and bitops APIs.



"Semantics and Behavior of Atomic and Bitmask Operations" is
pretty direct :)

Sure, it says that it's for arch maintainers, but there is no
reason why users can't make use of it.



Prompted by this thread, I reread this document, and indeed, the
sentence "Unlike the above routines, it is required that explicit memory
barriers are performed before and after [atomic_{inc,dec}_return]"
indicates that atomic_read (one of the "above routines") is very
different from all other atomic_t accessors that return values.

This is strange.  Why is it that atomic_read stands out that way?  IMO


It is not just atomic_read of course. It is atomic_add,sub,inc,dec,set.



this API imbalance is quite unexpected by many people.  Wouldn't it be
beneficial to change the atomic_read API to behave the same like all
other atomic_t accessors that return values?


It is very consistent and well defined. Operations which both modify
the data _and_ return something are defined to have full barriers
before and after.

What do you want to add to the other atomic accessors? Full memory
barriers? Only compiler barriers? It's quite likely that if you think
some barriers will fix bugs, then there are other bugs lurking there
anyway.

Just use spinlocks if you're not absolutely clear about potential
races and memory ordering issues -- they're pretty cheap and simple.



OK, it is also different from the other accessors that return data in so
far as it doesn't modify the data.  But as driver "author", i.e. user of
the API, I can't see much use of an atomic_read that can be reordered
and, more importantly, can be optimized away by the compiler.


It will return to you an atomic snapshot of the data (loaded from
memory at some point since the last compiler barrier). All you have
to be aware of compiler barriers and the Linux SMP memory ordering
model, which should be a given if you are writing lockless code.



Sure, now
that I learned of these properties I can start to audit code and insert
barriers where I believe they are needed, but this simply means that
almost all occurrences of atomic_read will get barriers (unless there
already are implicit but more or less obvious barriers like msleep).


You might find that these places that appear to need barriers are
buggy for other reasons anyway. Can you point to some in-tree code
we can have a look at?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH]: proc: export a processes resource limits via proc/

2007-08-17 Thread Valdis . Kletnieks

On Thu, 16 Aug 2007 08:35:38 EDT, Neil Horman said:
> Hey again-
>   Andrew requested that I repost this cleanly, after running the patch
> through checkpatch.  As requested here it is with the changelog.
> 
> Currently, there exists no method for a process to query the resource
> limits of another process.  They can be inferred via some mechanisms but they
> cannot be explicitly determined.  Given that this information can be usefull 
to
> know during the debugging of an application, I've written this patch which
> exports all of a processes limits via /proc//limits.  
> 
> Tested successfully by myself on x86 on top of 2.6.23-rc2-mm1.

I had only one comment the first time around, and Neil addressed it.

I've also tested on x86_64 23-rc2-mm1, and it works here too.  I saw where this
uses units of 'bytes' while the shell 'ulimit' uses 1024-byte units in some
places, but (a) this lists the units and (b) it's consistent with setrlimit().
Testing with values >4G show it's 64-bit clean as well.

One question:  Is the units milliseconds, or seconds here:

+   [RLIMIT_CPU] = {"Max cpu time", "ms"},

Other than that, feel free to stick either/both of these on:

Reviewed-By: Valdis Kletnieks <[EMAIL PROTECTED]>
Tested-By: Valdis Kletnieks <[EMAIL PROTECTED]>


pgpqnQgPc3e9C.pgp
Description: PGP signature

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma

On Fri, 17 Aug 2007, Paul Mackerras wrote:

> Herbert Xu writes:
> 
> > On Fri, Aug 17, 2007 at 03:09:57PM +1000, Paul Mackerras wrote:
> > > Herbert Xu writes:
> > > 
> > > > Can you find an actual atomic_read code snippet there that is
> > > > broken without the volatile modifier?
> > > 
> > > There are some in arch-specific code, for example line 1073 of
> > > arch/mips/kernel/smtc.c.  On mips, cpu_relax() is just barrier(), so
> > > the empty loop body is ok provided that atomic_read actually does the
> > > load each time around the loop.
> > 
> > A barrier() is all you need to force the compiler to reread
> > the value.
> > 
> > The people advocating volatile in this thread are talking
> > about code that doesn't use barrier()/cpu_relax().
> 
> Did you look at it?  Here it is:
> 
>   /* Someone else is initializing in parallel - let 'em finish */
>   while (atomic_read(&idle_hook_initialized) < 1000)
>   ;

Honestly, this thread is suffering from HUGE communication gaps.

What Herbert (obviously) meant there was that "this loop could've
been okay _without_ using volatile-semantics-atomic_read() also, if
only it used cpu_relax()".

That does work, because cpu_relax() is _at least_ barrier() on all
archs (on some it also emits some arch-dependent "pause" kind of
instruction).

Now, saying that "MIPS does not have such an instruction so I won't
use cpu_relax() for arch-dependent-busy-while-loops in arch/mips/"
sounds like a wrong argument, because: tomorrow, such arch's _may_
introduce such an instruction, so naturally, at that time we'd
change cpu_relax() appropriately (in reality, we would actually
*re-define* cpu_relax() and ensure that the correct version gets
pulled in depending on whether the callsite code is legacy or only
for the newer such CPUs of said arch, whatever), but loops such as
this would remain un-changed, because they never used cpu_relax()!

OTOH an argument that said the following would've made a stronger case:

"I don't want to use cpu_relax() because that's a full memory
clobber barrier() and I have loop-invariants / other variables
around in that code that I *don't* want the compiler to forget
just because it used cpu_relax(), and hence I will not use
cpu_relax() but instead make my atomic_read() itself have
"volatility" semantics. Not just that, but I will introduce a
cpu_relax_no_barrier() on MIPS, that would be a no-op #define
for now, but which may not be so forever, and continue to use
that in such busy loops."

In general, please read the thread-summary I've tried to do at:
http://lkml.org/lkml/2007/8/17/25
Feel free to continue / comment / correct stuff from there, there's
too much confusion and circular-arguments happening on this thread
otherwise.

[ I might've made an incorrect statement there about
  "volatile" w.r.t. cache on non-x86 archs, I think. ]

Satyam
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC 3/4]Introduce "account modifiers" mechanism

2007-08-17 Thread Rusty Russell

On Fri, 2007-08-17 at 09:35 +0200, Laurent Vivier wrote:
> Rusty Russell wrote:
> > Hi Laurent,
> 
> Hi Rusty,
> how are your puppies ?

They're getting a little fat, actually.  Too many features ...

> - remove PATCH 3, and add in task_struct a "ktime vtime" where we accumulate
> guest time (by calling something like guest_enter() and guest_exit() from the
> virtualization engine), and when in account_system_time() we have cputime >
> vtime we substrate vtime from cputime and add vtime to user time and guest 
> time.
> But doing like this we freeze in kernel/sched.c the link between system time,
> user time and guest time (i.e. system time = system time - vtime, user time =
> user time + vtime and guest time = guest time + vtime).

Actually, I think we can set a per-cpu "in_guest" flag for the scheduler
code, which then knows to add the tick to the guest time.  That seems
the simplest possible solution.

lguest or kvm would set the flag before running the guest (which is done
with preempt disabled or using preemption hooks), and reset it
afterwards.

Thoughts?
Rusty.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: how can I get an account on master.kernel.org machine

2007-08-17 Thread ye janboe

Randy, Eric,
Thanks you very much!

There is only diffs in http://ftp.arm.linux.org.uk/pub/armlinux/kernel/git-cur/.
What's these diffs base on, Linus' kernel tree?

2007/8/16, eric miao <[EMAIL PROTECTED]>:
> And if you purpose is only to download that tree, my suggestion is that
> some of the git trees are not public like linux-arm so the owner can do
> whatever he likes in the tree, and don't ever try to have access to it.
>
> for linux-arm, it will be constantly pulled by Linus, unless you really
> want the _most_ up-to-date patches, which you might find them at
>
>   http://ftp.arm.linux.org.uk/pub/armlinux/kernel/git-cur/
>
> that folder will be constantly refreshed by the maintainer.
>
> - eric
>
> Randy Dunlap wrote:
> > On Thu, 16 Aug 2007 10:58:41 +0800 ye janboe wrote:
> >
> >> I want to down linux-arm git repos, but I do not have an account on
> >> master machine.
> >
> > see http://www.kernel.org/faq/#account
> >
> > ---
> > ~Randy
> > *** Remember to use Documentation/SubmitChecklist when testing your code ***
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [EMAIL PROTECTED]
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
>
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma

On Fri, 17 Aug 2007, Nick Piggin wrote:

> Satyam Sharma wrote:
> 
> > #define atomic_read_volatile(v) \
> > ({  \
> > forget((v)->counter);   \
> > ((v)->counter); \
> > })
> > 
> > where:
> 
> *vomit* :)

I wonder if this'll generate smaller and better code than _both_ the
other atomic_read_volatile() variants. Would need to build allyesconfig
on lots of diff arch's etc to test the theory though.

> Not only do I hate the keyword volatile, but the barrier is only a
> one-sided affair so its probable this is going to have slightly
> different allowed reorderings than a real volatile access.

True ...

> Also, why would you want to make these insane accessors for atomic_t
> types? Just make sure everybody knows the basics of barriers, and they
> can apply that knowledge to atomic_t and all other lockless memory
> accesses as well.

Code that looks like:

while (!atomic_read(&v)) {
...
cpu_relax_no_barrier();
forget(v.counter);

}

would be uglier. Also think about code such as:

a = atomic_read();
if (!a)
do_something();

forget();
a = atomic_read();
... /* some code that depends on value of a, obviously */

forget();
a = atomic_read();
...

So much explicit sprinkling of "forget()" looks ugly.

atomic_read_volatile()

on the other hand, looks neater. The "_volatile()" suffix makes it also
no less explicit than an explicit barrier-like macro that this primitive
is something "special", for code clarity purposes.

> > #define forget(a)   __asm__ __volatile__ ("" :"=m" (a) :"m" (a))
> 
> I like order(x) better, but it's not the most perfect name either.

forget(x) is just a stupid-placeholder-for-a-better-name. order(x) sounds
good but we could leave quibbling about function or macro names for later,
this thread is noisy as it is :-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin


Satyam Sharma wrote:


On Fri, 17 Aug 2007, Herbert Xu wrote:



On Fri, Aug 17, 2007 at 01:43:27PM +1000, Paul Mackerras wrote:

BTW, the sort of missing barriers that triggered this thread
aren't that subtle.  It'll result in a simple lock-up if the
loop condition holds upon entry.  At which point it's fairly
straightforward to find the culprit.



Not necessarily. A barrier-less buggy code such as below:

atomic_set(&v, 0);

... /* some initial code */

while (atomic_read(&v))
;

... /* code that MUST NOT be executed unless v becomes non-zero */

(where v->counter is has no volatile access semantics)

could be generated by the compiler to simply *elid* or *do away* with
the loop itself, thereby making the:

"/* code that MUST NOT be executed unless v becomes non-zero */"

to be executed even when v is zero! That is subtle indeed, and causes
no hard lockups.


Then I presume you mean

while (!atomic_read(&v))
;

Which is just the same old infinite loop bug solved with cpu_relax().
These are pretty trivial to audit and fix, and also to debug, I would
think.



Granted, the above IS buggy code. But, the stated objective is to avoid
heisenbugs.


Anyway, why are you making up code snippets that are buggy in other
ways in order to support this assertion being made that lots of kernel
code supposedly depends on volatile semantics. Just reference the
actual code.



And we have driver / subsystem maintainers such as Stefan
coming up and admitting that often a lot of code that's written to use
atomic_read() does assume the read will not be elided by the compiler.


So these are broken on i386 and x86-64?

Are they definitely safe on SMP and weakly ordered machines with
just a simple compiler barrier there? Because I would not be
surprised if there are a lot of developers who don't really know
what to assume when it comes to memory ordering issues.

This is not a dig at driver writers: we still have memory ordering
problems in the VM too (and probably most of the subtle bugs in
lockless VM code are memory ordering ones). Let's not make up a
false sense of security and hope that sprinkling volatile around
will allow people to write bug-free lockless code. If a writer
can't be bothered reading API documentation and learning the Linux
memory model, they can still be productive writing safely locked
code.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[-mm PATCH 1/9] Memory controller resource counters (v6)

2007-08-17 Thread Balbir Singh


From: Pavel Emelianov <[EMAIL PROTECTED]>

Introduce generic structures and routines for resource accounting.

Each resource accounting container is supposed to aggregate it,
container_subsystem_state and its resource-specific members within.

Signed-off-by: Pavel Emelianov <[EMAIL PROTECTED]>
Signed-off-by: <[EMAIL PROTECTED]>
---

 include/linux/res_counter.h |  102 +
 init/Kconfig|7 ++
 kernel/Makefile |1 
 kernel/res_counter.c|  120 
 4 files changed, 230 insertions(+)

diff -puN /dev/null include/linux/res_counter.h
--- /dev/null   2007-06-01 20:42:04.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/include/linux/res_counter.h 2007-08-17 
13:14:18.0 +0530
@@ -0,0 +1,102 @@
+#ifndef __RES_COUNTER_H__
+#define __RES_COUNTER_H__
+
+/*
+ * Resource Counters
+ * Contain common data types and routines for resource accounting
+ *
+ * Copyright 2007 OpenVZ SWsoft Inc
+ *
+ * Author: Pavel Emelianov <[EMAIL PROTECTED]>
+ *
+ */
+
+#include 
+
+/*
+ * The core object. the container that wishes to account for some
+ * resource may include this counter into its structures and use
+ * the helpers described beyond
+ */
+
+struct res_counter {
+   /*
+* the current resource consumption level
+*/
+   unsigned long usage;
+   /*
+* the limit that usage cannot exceed
+*/
+   unsigned long limit;
+   /*
+* the number of unsuccessful attempts to consume the resource
+*/
+   unsigned long failcnt;
+   /*
+* the lock to protect all of the above.
+* the routines below consider this to be IRQ-safe
+*/
+   spinlock_t lock;
+};
+
+/*
+ * Helpers to interact with userspace
+ * res_counter_read/_write - put/get the specified fields from the
+ * res_counter struct to/from the user
+ *
+ * @counter: the counter in question
+ * @member:  the field to work with (see RES_xxx below)
+ * @buf: the buffer to opeate on,...
+ * @nbytes:  its size...
+ * @pos: and the offset.
+ */
+
+ssize_t res_counter_read(struct res_counter *counter, int member,
+   const char __user *buf, size_t nbytes, loff_t *pos);
+ssize_t res_counter_write(struct res_counter *counter, int member,
+   const char __user *buf, size_t nbytes, loff_t *pos);
+
+/*
+ * the field descriptors. one for each member of res_counter
+ */
+
+enum {
+   RES_USAGE,
+   RES_LIMIT,
+   RES_FAILCNT,
+};
+
+/*
+ * helpers for accounting
+ */
+
+void res_counter_init(struct res_counter *counter);
+
+/*
+ * charge - try to consume more resource.
+ *
+ * @counter: the counter
+ * @val: the amount of the resource. each controller defines its own
+ *   units, e.g. numbers, bytes, Kbytes, etc
+ *
+ * returns 0 on success and <0 if the counter->usage will exceed the
+ * counter->limit _locked call expects the counter->lock to be taken
+ */
+
+int res_counter_charge_locked(struct res_counter *counter, unsigned long val);
+int res_counter_charge(struct res_counter *counter, unsigned long val);
+
+/*
+ * uncharge - tell that some portion of the resource is released
+ *
+ * @counter: the counter
+ * @val: the amount of the resource
+ *
+ * these calls check for usage underflow and show a warning on the console
+ * _locked call expects the counter->lock to be taken
+ */
+
+void res_counter_uncharge_locked(struct res_counter *counter, unsigned long 
val);
+void res_counter_uncharge(struct res_counter *counter, unsigned long val);
+
+#endif
diff -puN init/Kconfig~res_counters_infra init/Kconfig
--- linux-2.6.23-rc2-mm2/init/Kconfig~res_counters_infra2007-08-17 
13:14:18.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/init/Kconfig2007-08-17 13:14:18.0 
+0530
@@ -337,6 +337,13 @@ config CPUSETS
 
  Say N if unsure.
 
+config RESOURCE_COUNTERS
+   bool "Resource counters"
+   help
+ This option enables controller independent resource accounting
+  infrastructure that works with containers
+   depends on CONTAINERS
+
 config SYSFS_DEPRECATED
bool "Create deprecated sysfs files"
default y
diff -puN kernel/Makefile~res_counters_infra kernel/Makefile
--- linux-2.6.23-rc2-mm2/kernel/Makefile~res_counters_infra 2007-08-17 
13:14:18.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/kernel/Makefile 2007-08-17 13:14:18.0 
+0530
@@ -58,6 +58,7 @@ obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
+obj-$(CONFIG_RESOURCE_COUNTERS) += res_counter.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <[EMAIL PROTECTED]>, the -fno-omit-frame-pointer is
diff -puN /dev/null kernel/res_counter.c
--- /dev/null   2007-06-01 20:42:04.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/kernel/res_counter.c2007-08-

[-mm PATCH 5/9] Memory controller task migration (v6)

2007-08-17 Thread Balbir Singh


Allow tasks to migrate from one container to the other. We migrate
mm_struct's mem_container only when the thread group id migrates.


Signed-off-by: <[EMAIL PROTECTED]>
---

 mm/memcontrol.c |   35 +++
 1 file changed, 35 insertions(+)

diff -puN mm/memcontrol.c~mem-control-task-migration mm/memcontrol.c
--- linux-2.6.23-rc2-mm2/mm/memcontrol.c~mem-control-task-migration 
2007-08-17 13:14:19.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/mm/memcontrol.c 2007-08-17 13:14:19.0 
+0530
@@ -326,11 +326,46 @@ static int mem_container_populate(struct
ARRAY_SIZE(mem_container_files));
 }
 
+static void mem_container_move_task(struct container_subsys *ss,
+   struct container *cont,
+   struct container *old_cont,
+   struct task_struct *p)
+{
+   struct mm_struct *mm;
+   struct mem_container *mem, *old_mem;
+
+   mm = get_task_mm(p);
+   if (mm == NULL)
+   return;
+
+   mem = mem_container_from_cont(cont);
+   old_mem = mem_container_from_cont(old_cont);
+
+   if (mem == old_mem)
+   goto out;
+
+   /*
+* Only thread group leaders are allowed to migrate, the mm_struct is
+* in effect owned by the leader
+*/
+   if (p->tgid != p->pid)
+   goto out;
+
+   css_get(&mem->css);
+   rcu_assign_pointer(mm->mem_container, mem);
+   css_put(&old_mem->css);
+
+out:
+   mmput(mm);
+   return;
+}
+
 struct container_subsys mem_container_subsys = {
.name = "memory",
.subsys_id = mem_container_subsys_id,
.create = mem_container_create,
.destroy = mem_container_destroy,
.populate = mem_container_populate,
+   .attach = mem_container_move_task,
.early_init = 1,
 };
_

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[-mm PATCH 4/9] Memory controller memory accounting (v6)

2007-08-17 Thread Balbir Singh



Changelog for v6
1. Do a css_put() in the case of a race in allocating page containers
   (YAMAMOTO Takashi)

Changelog for v5
1. Rename meta_page to page_container
2. Remove PG_metapage and use the lower bit of the page_container pointer
   for locking

Changelog for v3

1. Fix a probable leak with meta_page's (pointed out by Paul Menage)
2. Introduce a wrapper around mem_container_uncharge for uncharging pages
   mem_container_uncharge_page()

Changelog

1. Improved error handling, uncharge on errors and check to see if we are
   leaking pages (review by YAMAMOTO Takashi)

Add the accounting hooks. The accounting is carried out for RSS and Page
Cache (unmapped) pages. There is now a common limit and accounting for both.
The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
time. Page cache is accounted at add_to_page_cache(),
__delete_from_page_cache(). Swap cache is also accounted for.

Each page's page_container is protected with the last bit of the page_container
pointer, this makes handling of race conditions involving simultaneous
mappings of a page easier. A reference count is kept in the page_container to
deal with cases where a page might be unmapped from the RSS of all tasks, but
still lives in the page cache.

Credits go to Vaidyanathan Srinivasan for helping with reference counting work
of the page container. Almost all of the page cache accounting code has help
from Vaidyanathan Srinivasan.

Signed-off-by: Vaidyanathan Srinivasan <[EMAIL PROTECTED]>
Signed-off-by: <[EMAIL PROTECTED]>
---

 include/linux/memcontrol.h |   20 +
 mm/filemap.c   |   12 ++-
 mm/memcontrol.c|  166 -
 mm/memory.c|   43 ++-
 mm/migrate.c   |6 +
 mm/page_alloc.c|3 
 mm/rmap.c  |   17 
 mm/swap_state.c|   12 ++-
 mm/swapfile.c  |   41 ++-
 9 files changed, 293 insertions(+), 27 deletions(-)

diff -puN include/linux/memcontrol.h~mem-control-accounting 
include/linux/memcontrol.h
--- linux-2.6.23-rc2-mm2/include/linux/memcontrol.h~mem-control-accounting  
2007-08-17 13:14:19.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/include/linux/memcontrol.h  2007-08-17 
13:14:19.0 +0530
@@ -30,6 +30,13 @@ extern void mm_free_container(struct mm_
 extern void page_assign_page_container(struct page *page,
struct page_container *pc);
 extern struct page_container *page_get_page_container(struct page *page);
+extern int mem_container_charge(struct page *page, struct mm_struct *mm);
+extern void mem_container_uncharge(struct page_container *pc);
+
+static inline void mem_container_uncharge_page(struct page *page)
+{
+   mem_container_uncharge(page_get_page_container(page));
+}
 
 #else /* CONFIG_CONTAINER_MEM_CONT */
 static inline void mm_init_container(struct mm_struct *mm,
@@ -51,6 +58,19 @@ static inline struct page_container *pag
return NULL;
 }
 
+static inline int mem_container_charge(struct page *page, struct mm_struct *mm)
+{
+   return 0;
+}
+
+static inline void mem_container_uncharge(struct page_container *pc)
+{
+}
+
+static inline void mem_container_uncharge_page(struct page *page)
+{
+}
+
 #endif /* CONFIG_CONTAINER_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff -puN include/linux/page-flags.h~mem-control-accounting 
include/linux/page-flags.h
diff -puN mm/filemap.c~mem-control-accounting mm/filemap.c
--- linux-2.6.23-rc2-mm2/mm/filemap.c~mem-control-accounting2007-08-17 
13:14:19.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/mm/filemap.c2007-08-17 13:14:19.0 
+0530
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include  /* for BUG_ON(!in_atomic()) only */
+#include 
 #include "internal.h"
 
 /*
@@ -116,6 +117,7 @@ void __remove_from_page_cache(struct pag
 {
struct address_space *mapping = page->mapping;
 
+   mem_container_uncharge_page(page);
radix_tree_delete(&mapping->page_tree, page->index);
page->mapping = NULL;
mapping->nrpages--;
@@ -442,6 +444,11 @@ int add_to_page_cache(struct page *page,
int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
if (error == 0) {
+
+   error = mem_container_charge(page, current->mm);
+   if (error)
+   goto out;
+
write_lock_irq(&mapping->tree_lock);
error = radix_tree_insert(&mapping->page_tree, offset, page);
if (!error) {
@@ -451,10 +458,13 @@ int add_to_page_cache(struct page *page,
page->index = offset;
mapping->nrpages++;
__inc_zone_page_state(page, NR_FILE_PAGES);
-   }
+   } else
+   mem_container_uncharge_page(page);
+
write_unlock_irq(&mapping->tree_lock);
radix_tree_preload_end();

[-mm PATCH 2/9] Memory controller containers setup (v6)

2007-08-17 Thread Balbir Singh


Changelong
1. use depends instead of select in init/Kconfig
2. Port to v11
3. Clean up the usage of names (container files) for v11

Setup the memory container and add basic hooks and controls to integrate
and work with the container.

Signed-off-by: <[EMAIL PROTECTED]>
---

 include/linux/container_subsys.h |6 +
 include/linux/memcontrol.h   |   21 ++
 init/Kconfig |7 ++
 mm/Makefile  |1 
 mm/memcontrol.c  |  127 +++
 5 files changed, 162 insertions(+)

diff -puN include/linux/container_subsys.h~mem-control-setup 
include/linux/container_subsys.h
--- linux-2.6.23-rc2-mm2/include/linux/container_subsys.h~mem-control-setup 
2007-08-17 13:14:19.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/include/linux/container_subsys.h
2007-08-17 13:14:19.0 +0530
@@ -30,3 +30,9 @@ SUBSYS(ns)
 #endif
 
 /* */
+
+#ifdef CONFIG_CONTAINER_MEM_CONT
+SUBSYS(mem_container)
+#endif
+
+/* */
diff -puN /dev/null include/linux/memcontrol.h
--- /dev/null   2007-06-01 20:42:04.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/include/linux/memcontrol.h  2007-08-17 
13:14:19.0 +0530
@@ -0,0 +1,21 @@
+/* memcontrol.h - Memory Controller
+ *
+ * Copyright IBM Corporation, 2007
+ * Author Balbir Singh <[EMAIL PROTECTED]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef _LINUX_MEMCONTROL_H
+#define _LINUX_MEMCONTROL_H
+
+#endif /* _LINUX_MEMCONTROL_H */
+
diff -puN init/Kconfig~mem-control-setup init/Kconfig
--- linux-2.6.23-rc2-mm2/init/Kconfig~mem-control-setup 2007-08-17 
13:14:19.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/init/Kconfig2007-08-17 13:14:19.0 
+0530
@@ -364,6 +364,13 @@ config SYSFS_DEPRECATED
  If you are using a distro that was released in 2006 or later,
  it should be safe to say N here.
 
+config CONTAINER_MEM_CONT
+   bool "Memory controller for containers"
+   depends on CONTAINERS && RESOURCE_COUNTERS
+   help
+ Provides a memory controller that manages both page cache and
+ RSS memory.
+
 config PROC_PID_CPUSET
bool "Include legacy /proc//cpuset file"
depends on CPUSETS
diff -puN mm/Makefile~mem-control-setup mm/Makefile
--- linux-2.6.23-rc2-mm2/mm/Makefile~mem-control-setup  2007-08-17 
13:14:19.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/mm/Makefile 2007-08-17 13:14:19.0 
+0530
@@ -31,4 +31,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
+obj-$(CONFIG_CONTAINER_MEM_CONT) += memcontrol.o
 
diff -puN /dev/null mm/memcontrol.c
--- /dev/null   2007-06-01 20:42:04.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/mm/memcontrol.c 2007-08-17 13:14:19.0 
+0530
@@ -0,0 +1,127 @@
+/* memcontrol.c - Memory Controller
+ *
+ * Copyright IBM Corporation, 2007
+ * Author Balbir Singh <[EMAIL PROTECTED]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+
+struct container_subsys mem_container_subsys;
+
+/*
+ * The memory controller data structure. The memory controller controls both
+ * page cache and RSS per container. We would eventually like to provide
+ * statistics based on the statistics developed by Rik Van Riel for clock-pro,
+ * to help the administrator determine what knobs to tune.
+ *
+ * TODO: Add a water mark for the memory controller. Reclaim will begin when
+ * we hit the water mark.
+ */
+struct mem_container {
+   struct container_subsys_state css;
+   /*
+* the counter to account for memory usage
+*/
+   struct res_counter res;
+};
+
+/*
+ * A page_container page is associated with every page descriptor. The
+ * page_container helps us identify information about the container
+ */
+struct page_container {
+   struct list_head lru;   /* per container LRU list */
+   struct page *page;
+   struct mem_container *mem_container;
+};
+
+
+static inline
+st

[-mm PATCH 0/9] Memory controller introduction (v6)

2007-08-17 Thread Balbir Singh

Here's version 6 of the memory controller (against 2.6.23-rc2-mm2).

The tests that were run has been included in the _Test Results section_ below.

Changelog since version 5

1. Ported to 2.6.23-rc2-mm2
2. Added a css_put() in the case of race between allocation of page_containers
   (YAMAMOTO Takashi)

Changelog since version 4

1. Renamed meta_page to page_container (Nick Piggin)
2. Moved locking from page flags to last bit of the page_container pointer
   (Nick Piggin)
3. Fixed a rare race in mem_container_isolate_pages (YAMAMOTO Takashi)

Changelog since version 3

1. Ported to v11 of the containers patchset (2.6.23-rc1-mm1). Paul Menage
   helped immensely with a detailed review of v3
2. Reclaim is retried to allow reclaim of pages coming in as a result
   of mapped pages reclaim (swap cache growing as a result of RSS reclaim)
3. page_referenced() is now container aware. During container reclaim,
   references from other containers do not prevent a page from being
   reclaimed from a non-referencing container
4. Fixed a possible race condition spotted by YAMAMOTO Takashi

Changelog since version 2

1. Improved error handling in mm/memory.c (spotted by YAMAMOTO Takashi)
2. Test results included
3. try_to_free_mem_container_pages() bug fix (sc->may_writepage is now
   set to !laptop_mode)

Changelog since version 1

1. Fixed some compile time errors (in mm/migrate.c from Vaidyanathan S)
2. Fixed a panic seen when LIST_DEBUG is enabled
3. Added a mechanism to control whether we track page cache or both
   page cache and mapped pages (as requested by Pavel)
4. Dave Hansen provided detail review comments on the code.

This patchset implements another version of the memory controller. These
patches have been through a big churn, the first set of patches were posted
last year and earlier this year at
http://lkml.org/lkml/2007/2/19/10

This patchset draws from the patches listed above and from some of the
contents of the patches posted by Vaidyanathan for page cache control.
http://lkml.org/lkml/2007/6/20/92

At OLS, the resource management BOF, it was discussed that we need to manage
RSS and unmapped page cache together. This patchset is a step towards that

TODO's

1. Add memory controller water mark support. Reclaim on high water mark
2. Add support for shrinking on limit change
3. Add per zone per container LRU lists (this is being actively worked
   on by Pavel Emelianov)
4. Figure out a better CLUI for the controller
5. Add better statistics
6. Explore using read_unit64() as recommended by Paul Menage
   (NOTE: read_ulong() would also be nice to have)

In case you have been using/testing the RSS controller, you'll find that
this controller works slower than the RSS controller. The reason being
that both swap cache and page cache is accounted for, so pages do go
out to swap upon reclaim (they cannot live in the swap cache).

Any test output, feedback, comments, suggestions are welcome! I am committed
to fixing any bugs and improving the performance of the memory controller.
Do not hesitate to send any fixes, request for fixes that is required.

Using the patches

1. Enable Memory controller configuration
2. Compile and boot the new kernel
3. mount -t container container -o memory /container
   will mount the memory controller to the /container mount point
4. mkdir /container/a
5. echo $$ > /container/a/tasks (add tasks to the new container)
6. echo -n  > /container/a/memory.limit
   example
   echo -n 204800 > /container/a/memory.limit, sets the limit to 800 MB
   on a system with 4K page size
7. run tasks, see the memory controller work
8. Report results, provide feedback
9. Develop/use new patches and go to step 1

Test Results

Results for version 3 of the patch were posted at
http://lwn.net/Articles/242554/

The code was also tested on a power box with regular machine usage scenarios,
the config disabled and with a stress suite that touched all the memory
in the system and was limited in a container.

Dhaval ran several tests on v6 and gave his thumbs up to the controller
(a hard to achieve goal :-) ).

Run kernbench stress

Three simultaneously and with one inside a container of 800 MB.


Kernbench results running within the container of 800 MB


Thu Aug 16 22:34:59 IST 2007
2.6.23-rc2-mm2-mem-v6
Average Half load -j 4 Run (std deviation):
Elapsed Time 466.548 (47.6014)
User Time 876.598 (10.5273)
System Time 223.136 (1.29247)
Percent CPU 237.2 (23.2744)
Context Switches 146351 (6539.91)
Sleeps 174003 (5031.94)

Average Optimal load -j 32 Run (std deviation):
Elapsed Time 423.496 (60.625)
User Time 897.285 (23.0391)
System Time 228.836 (6.11205)
Percent CPU 257.1 (40.9022)
Context Switches 262134 (123397)
Sleeps 270815 (103597)


Kernbench results running within the default container
--

Thu Aug 16 22:34:33 IST 2007
2.6.23-rc2-mm2-mem-v6
Average Half load -

[-mm PATCH 7/9] Memory controller OOM handling (v6)

2007-08-17 Thread Balbir Singh


From: Pavel Emelianov <[EMAIL PROTECTED]>

Out of memory handling for containers over their limit. A task from the
container over limit is chosen using the existing OOM logic and killed.

TODO:
1. As discussed in the OLS BOF session, consider implementing a user
space policy for OOM handling.

Signed-off-by: Pavel Emelianov <[EMAIL PROTECTED]>

Signed-off-by: <[EMAIL PROTECTED]>
---

 include/linux/memcontrol.h |1 +
 mm/memcontrol.c|1 +
 mm/oom_kill.c  |   42 ++
 3 files changed, 40 insertions(+), 4 deletions(-)

diff -puN include/linux/memcontrol.h~mem-control-out-of-memory 
include/linux/memcontrol.h
--- linux-2.6.23-rc2-mm2/include/linux/memcontrol.h~mem-control-out-of-memory   
2007-08-17 13:14:20.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/include/linux/memcontrol.h  2007-08-17 
13:14:20.0 +0530
@@ -39,6 +39,7 @@ extern unsigned long mem_container_isola
int mode, struct zone *z,
struct mem_container *mem_cont,
int active);
+extern void mem_container_out_of_memory(struct mem_container *mem);
 
 static inline void mem_container_uncharge_page(struct page *page)
 {
diff -puN mm/memcontrol.c~mem-control-out-of-memory mm/memcontrol.c
--- linux-2.6.23-rc2-mm2/mm/memcontrol.c~mem-control-out-of-memory  
2007-08-17 13:14:20.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/mm/memcontrol.c 2007-08-17 13:14:20.0 
+0530
@@ -322,6 +322,7 @@ int mem_container_charge(struct page *pa
}
 
css_put(&mem->css);
+   mem_container_out_of_memory(mem);
goto free_pc;
}
 
diff -puN mm/oom_kill.c~mem-control-out-of-memory mm/oom_kill.c
--- linux-2.6.23-rc2-mm2/mm/oom_kill.c~mem-control-out-of-memory
2007-08-17 13:14:20.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/mm/oom_kill.c   2007-08-17 13:14:20.0 
+0530
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 int sysctl_panic_on_oom;
 /* #define DEBUG */
@@ -48,7 +49,8 @@ int sysctl_panic_on_oom;
  *of least surprise ... (be careful when you change it)
  */
 
-unsigned long badness(struct task_struct *p, unsigned long uptime)
+unsigned long badness(struct task_struct *p, unsigned long uptime,
+   struct mem_container *mem)
 {
unsigned long points, cpu_time, run_time, s;
struct mm_struct *mm;
@@ -61,6 +63,13 @@ unsigned long badness(struct task_struct
return 0;
}
 
+#ifdef CONFIG_CONTAINER_MEM_CONT
+   if (mem != NULL && mm->mem_container != mem) {
+   task_unlock(p);
+   return 0;
+   }
+#endif
+
/*
 * The memory size of the process is the basis for the badness.
 */
@@ -198,7 +207,8 @@ static inline int constrained_alloc(stru
  *
  * (not docbooked, we don't want this one cluttering up the manual)
  */
-static struct task_struct *select_bad_process(unsigned long *ppoints)
+static struct task_struct *select_bad_process(unsigned long *ppoints,
+   struct mem_container *mem)
 {
struct task_struct *g, *p;
struct task_struct *chosen = NULL;
@@ -252,7 +262,7 @@ static struct task_struct *select_bad_pr
if (p->oomkilladj == OOM_DISABLE)
continue;
 
-   points = badness(p, uptime.tv_sec);
+   points = badness(p, uptime.tv_sec, mem);
if (points > *ppoints || !chosen) {
chosen = p;
*ppoints = points;
@@ -364,6 +374,30 @@ static int oom_kill_process(struct task_
return oom_kill_task(p);
 }
 
+#ifdef CONFIG_CONTAINER_MEM_CONT
+void mem_container_out_of_memory(struct mem_container *mem)
+{
+   unsigned long points = 0;
+   struct task_struct *p;
+
+   container_lock();
+   rcu_read_lock();
+retry:
+   p = select_bad_process(&points, mem);
+   if (PTR_ERR(p) == -1UL)
+   goto out;
+
+   if (!p)
+   p = current;
+
+   if (oom_kill_process(p, points, "Memory container out of memory"))
+   goto retry;
+out:
+   rcu_read_unlock();
+   container_unlock();
+}
+#endif
+
 static BLOCKING_NOTIFIER_HEAD(oom_notify_list);
 
 int register_oom_notifier(struct notifier_block *nb)
@@ -436,7 +470,7 @@ retry:
 * Rambo mode: Shoot down a process and hope it solves whatever
 * issues we may have.
 */
-   p = select_bad_process(&points);
+   p = select_bad_process(&points, NULL);
 
if (PTR_ERR(p) == -1UL)
goto out;
_

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a mes

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma

On Fri, 17 Aug 2007, Nick Piggin wrote:

> Stefan Richter wrote:
> [...]
> Just use spinlocks if you're not absolutely clear about potential
> races and memory ordering issues -- they're pretty cheap and simple.

I fully agree with this. As Paul Mackerras mentioned elsewhere,
a lot of authors sprinkle atomic_t in code thinking they're somehow
done with *locking*. This is sad, and I wonder if it's time for a
Documentation/atomic-considered-dodgy.txt kind of document :-)

> > Sure, now
> > that I learned of these properties I can start to audit code and insert
> > barriers where I believe they are needed, but this simply means that
> > almost all occurrences of atomic_read will get barriers (unless there
> > already are implicit but more or less obvious barriers like msleep).
> 
> You might find that these places that appear to need barriers are
> buggy for other reasons anyway. Can you point to some in-tree code
> we can have a look at?

Such code was mentioned elsewhere (query nodemgr_host_thread in cscope)
that managed to escape the requirement for a barrier only because of
some completely un-obvious compilation-unit-scope thing. But I find such
an non-explicit barrier quite bad taste. Stefan, do consider plunking an
explicit call to barrier() there.

Satyam
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[-mm PATCH 9/9] Memory controller make page_referenced() container aware (v6)

2007-08-17 Thread Balbir Singh


Make page_referenced() container aware. Without this patch, page_referenced()
can cause a page to be skipped while reclaiming pages. This patch
ensures that other containers do not hold pages in a particular container
hostage. It is required to ensure that shared pages are freed from a container
when they are not actively referenced from the container that brought
them in


Signed-off-by: <[EMAIL PROTECTED]>
---

 include/linux/memcontrol.h |6 ++
 include/linux/rmap.h   |5 +++--
 mm/memcontrol.c|5 +
 mm/rmap.c  |   30 --
 mm/vmscan.c|4 ++--
 5 files changed, 40 insertions(+), 10 deletions(-)

diff -puN include/linux/rmap.h~mem-control-per-container-page-referenced 
include/linux/rmap.h
--- 
linux-2.6.23-rc2-mm2/include/linux/rmap.h~mem-control-per-container-page-referenced
 2007-08-17 13:14:20.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/include/linux/rmap.h2007-08-17 
13:14:20.0 +0530
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * The anon_vma heads a list of private "related" vmas, to scan if
@@ -86,7 +87,7 @@ static inline void page_dup_rmap(struct 
 /*
  * Called from mm/vmscan.c to handle paging out
  */
-int page_referenced(struct page *, int is_locked);
+int page_referenced(struct page *, int is_locked, struct mem_container *cnt);
 int try_to_unmap(struct page *, int ignore_refs);
 
 /*
@@ -114,7 +115,7 @@ int page_mkclean(struct page *);
 #define anon_vma_prepare(vma)  (0)
 #define anon_vma_link(vma) do {} while (0)
 
-#define page_referenced(page,l) TestClearPageReferenced(page)
+#define page_referenced(page,l,cnt) TestClearPageReferenced(page)
 #define try_to_unmap(page, refs) SWAP_FAIL
 
 static inline int page_mkclean(struct page *page)
diff -puN mm/rmap.c~mem-control-per-container-page-referenced mm/rmap.c
--- linux-2.6.23-rc2-mm2/mm/rmap.c~mem-control-per-container-page-referenced
2007-08-17 13:14:20.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/mm/rmap.c   2007-08-17 13:14:20.0 
+0530
@@ -299,7 +299,8 @@ out:
return referenced;
 }
 
-static int page_referenced_anon(struct page *page)
+static int page_referenced_anon(struct page *page,
+   struct mem_container *mem_cont)
 {
unsigned int mapcount;
struct anon_vma *anon_vma;
@@ -312,6 +313,13 @@ static int page_referenced_anon(struct p
 
mapcount = page_mapcount(page);
list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+   /*
+* If we are reclaiming on behalf of a container, skip
+* counting on behalf of references from different
+* containers
+*/
+   if (mem_cont && (mm_container(vma->vm_mm) != mem_cont))
+   continue;
referenced += page_referenced_one(page, vma, &mapcount);
if (!mapcount)
break;
@@ -332,7 +340,8 @@ static int page_referenced_anon(struct p
  *
  * This function is only called from page_referenced for object-based pages.
  */
-static int page_referenced_file(struct page *page)
+static int page_referenced_file(struct page *page,
+   struct mem_container *mem_cont)
 {
unsigned int mapcount;
struct address_space *mapping = page->mapping;
@@ -365,6 +374,13 @@ static int page_referenced_file(struct p
mapcount = page_mapcount(page);
 
vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+   /*
+* If we are reclaiming on behalf of a container, skip
+* counting on behalf of references from different
+* containers
+*/
+   if (mem_cont && (mm_container(vma->vm_mm) != mem_cont))
+   continue;
if ((vma->vm_flags & (VM_LOCKED|VM_MAYSHARE))
  == (VM_LOCKED|VM_MAYSHARE)) {
referenced++;
@@ -387,7 +403,8 @@ static int page_referenced_file(struct p
  * Quick test_and_clear_referenced for all mappings to a page,
  * returns the number of ptes which referenced the page.
  */
-int page_referenced(struct page *page, int is_locked)
+int page_referenced(struct page *page, int is_locked,
+   struct mem_container *mem_cont)
 {
int referenced = 0;
 
@@ -399,14 +416,15 @@ int page_referenced(struct page *page, i
 
if (page_mapped(page) && page->mapping) {
if (PageAnon(page))
-   referenced += page_referenced_anon(page);
+   referenced += page_referenced_anon(page, mem_cont);
else if (is_locked)
-   referenced += page_referenced_file(page);
+   referenced += page_referenced_file(page, mem_cont);
else if (TestSetPageLocked(page))

[-mm PATCH 8/9] Memory controller add switch to control what type of pages to limit (v6)

2007-08-17 Thread Balbir Singh


Choose if we want cached pages to be accounted or not. By default both
are accounted for. A new set of tunables are added.

echo -n 1 > mem_control_type

switches the accounting to account for only mapped pages

echo -n 3 > mem_control_type

switches the behaviour back

Signed-off-by: <[EMAIL PROTECTED]>
---

 include/linux/memcontrol.h |9 
 mm/filemap.c   |2 
 mm/memcontrol.c|   92 +
 mm/swap_state.c|2 
 4 files changed, 103 insertions(+), 2 deletions(-)

diff -puN 
include/linux/memcontrol.h~mem-control-choose-rss-vs-rss-and-pagecache 
include/linux/memcontrol.h
--- 
linux-2.6.23-rc2-mm2/include/linux/memcontrol.h~mem-control-choose-rss-vs-rss-and-pagecache
 2007-08-17 13:14:20.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/include/linux/memcontrol.h  2007-08-17 
13:14:20.0 +0530
@@ -20,6 +20,8 @@
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
 
+#include 
+
 struct mem_container;
 struct page_container;
 
@@ -40,6 +42,7 @@ extern unsigned long mem_container_isola
struct mem_container *mem_cont,
int active);
 extern void mem_container_out_of_memory(struct mem_container *mem);
+extern int mem_container_cache_charge(struct page *page, struct mm_struct *mm);
 
 static inline void mem_container_uncharge_page(struct page *page)
 {
@@ -84,6 +87,12 @@ static inline void mem_container_move_li
 {
 }
 
+static inline int mem_container_cache_charge(struct page *page,
+   struct mm_struct *mm)
+{
+   return 0;
+}
+
 #endif /* CONFIG_CONTAINER_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff -puN mm/filemap.c~mem-control-choose-rss-vs-rss-and-pagecache mm/filemap.c
--- 
linux-2.6.23-rc2-mm2/mm/filemap.c~mem-control-choose-rss-vs-rss-and-pagecache   
2007-08-17 13:14:20.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/mm/filemap.c2007-08-17 13:14:20.0 
+0530
@@ -445,7 +445,7 @@ int add_to_page_cache(struct page *page,
 
if (error == 0) {
 
-   error = mem_container_charge(page, current->mm);
+   error = mem_container_cache_charge(page, current->mm);
if (error)
goto out;
 
diff -puN mm/memcontrol.c~mem-control-choose-rss-vs-rss-and-pagecache 
mm/memcontrol.c
--- 
linux-2.6.23-rc2-mm2/mm/memcontrol.c~mem-control-choose-rss-vs-rss-and-pagecache
2007-08-17 13:14:20.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/mm/memcontrol.c 2007-08-17 13:14:20.0 
+0530
@@ -28,6 +28,8 @@
 #include 
 #include 
 
+#include 
+
 struct container_subsys mem_container_subsys;
 static const int MEM_CONTAINER_RECLAIM_RETRIES = 5;
 
@@ -59,6 +61,7 @@ struct mem_container {
 * spin_lock to protect the per container LRU
 */
spinlock_t lru_lock;
+   unsigned long control_type; /* control RSS or RSS+Pagecache */
 };
 
 /*
@@ -81,6 +84,15 @@ struct page_container {
/* mapped and cached states */
 };
 
+enum {
+   MEM_CONTAINER_TYPE_UNSPEC = 0,
+   MEM_CONTAINER_TYPE_MAPPED,
+   MEM_CONTAINER_TYPE_CACHED,
+   MEM_CONTAINER_TYPE_ALL,
+   MEM_CONTAINER_TYPE_MAX,
+} mem_control_type;
+
+static struct mem_container init_mem_container;
 
 static inline
 struct mem_container *mem_container_from_cont(struct container *cont)
@@ -361,6 +373,22 @@ err:
 }
 
 /*
+ * See if the cached pages should be charged at all?
+ */
+int mem_container_cache_charge(struct page *page, struct mm_struct *mm)
+{
+   struct mem_container *mem;
+   if (!mm)
+   mm = &init_mm;
+
+   mem = rcu_dereference(mm->mem_container);
+   if (mem->control_type == MEM_CONTAINER_TYPE_ALL)
+   return mem_container_charge(page, mm);
+   else
+   return 0;
+}
+
+/*
  * Uncharging is always a welcome operation, we never complain, simply
  * uncharge.
  */
@@ -370,6 +398,10 @@ void mem_container_uncharge(struct page_
struct page *page;
unsigned long flags;
 
+   /*
+* This can handle cases when a page is not charged at all and we
+* are switching between handling the control_type.
+*/
if (!pc)
return;
 
@@ -405,6 +437,60 @@ static ssize_t mem_container_write(struc
cft->private, userbuf, nbytes, ppos);
 }
 
+static ssize_t mem_control_type_write(struct container *cont,
+   struct cftype *cft, struct file *file,
+   const char __user *userbuf,
+   size_t nbytes, loff_t *pos)
+{
+   int ret;
+   char *buf, *end;
+   unsigned long tmp;
+   struct mem_container *mem;
+
+   mem = mem_container_from_cont(cont);
+   buf = kmalloc(nbytes + 1, GFP_KERNEL);
+   ret = -ENOMEM;
+   if (buf == NULL)
+   goto out;
+
+

[-mm PATCH 6/9] Memory controller add per container LRU and reclaim (v6)

2007-08-17 Thread Balbir Singh



Changelog since v3
1. Added reclaim retry logic to avoid being OOM'ed due to pages from
   swap cache (coming in due to reclaim) don't overwhelm the container.

Changelog
1. Fix probable NULL pointer dereference based on review comments
   by YAMAMOTO Takashi

Add the page_container to the per container LRU. The reclaim algorithm has been
modified to make the isolate_lru_pages() as a pluggable component. The
scan_control data structure now accepts the container on behalf of which
reclaims are carried out. try_to_free_pages() has been extended to become
container aware.

Signed-off-by: Pavel Emelianov <[EMAIL PROTECTED]>
Signed-off-by: <[EMAIL PROTECTED]>
---

 include/linux/memcontrol.h  |   12 +++
 include/linux/res_counter.h |   23 +++
 include/linux/swap.h|3 
 mm/memcontrol.c |  135 +++-
 mm/swap.c   |2 
 mm/vmscan.c |  126 -
 6 files changed, 275 insertions(+), 26 deletions(-)

diff -puN include/linux/memcontrol.h~mem-control-lru-and-reclaim 
include/linux/memcontrol.h
--- linux-2.6.23-rc2-mm2/include/linux/memcontrol.h~mem-control-lru-and-reclaim 
2007-08-17 13:14:20.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/include/linux/memcontrol.h  2007-08-17 
13:14:20.0 +0530
@@ -32,6 +32,13 @@ extern void page_assign_page_container(s
 extern struct page_container *page_get_page_container(struct page *page);
 extern int mem_container_charge(struct page *page, struct mm_struct *mm);
 extern void mem_container_uncharge(struct page_container *pc);
+extern void mem_container_move_lists(struct page_container *pc, bool active);
+extern unsigned long mem_container_isolate_pages(unsigned long nr_to_scan,
+   struct list_head *dst,
+   unsigned long *scanned, int order,
+   int mode, struct zone *z,
+   struct mem_container *mem_cont,
+   int active);
 
 static inline void mem_container_uncharge_page(struct page *page)
 {
@@ -71,6 +78,11 @@ static inline void mem_container_uncharg
 {
 }
 
+static inline void mem_container_move_lists(struct page_container *pc,
+   bool active)
+{
+}
+
 #endif /* CONFIG_CONTAINER_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff -puN include/linux/res_counter.h~mem-control-lru-and-reclaim 
include/linux/res_counter.h
--- 
linux-2.6.23-rc2-mm2/include/linux/res_counter.h~mem-control-lru-and-reclaim
2007-08-17 13:14:20.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/include/linux/res_counter.h 2007-08-17 
13:14:20.0 +0530
@@ -99,4 +99,27 @@ int res_counter_charge(struct res_counte
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long 
val);
 void res_counter_uncharge(struct res_counter *counter, unsigned long val);
 
+static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
+{
+   if (cnt->usage < cnt->limit)
+   return true;
+
+   return false;
+}
+
+/*
+ * Helper function to detect if the container is within it's limit or
+ * not. It's currently called from container_rss_prepare()
+ */
+static inline bool res_counter_check_under_limit(struct res_counter *cnt)
+{
+   bool ret;
+   unsigned long flags;
+
+   spin_lock_irqsave(&cnt->lock, flags);
+   ret = res_counter_limit_check_locked(cnt);
+   spin_unlock_irqrestore(&cnt->lock, flags);
+   return ret;
+}
+
 #endif
diff -puN include/linux/swap.h~mem-control-lru-and-reclaim include/linux/swap.h
--- linux-2.6.23-rc2-mm2/include/linux/swap.h~mem-control-lru-and-reclaim   
2007-08-17 13:14:20.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/include/linux/swap.h2007-08-17 
13:14:20.0 +0530
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -191,6 +192,8 @@ extern void swap_setup(void);
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zone **zones, int order,
gfp_t gfp_mask);
+extern unsigned long try_to_free_mem_container_pages(struct mem_container 
*mem);
+extern int __isolate_lru_page(struct page *page, int mode);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
diff -puN mm/memcontrol.c~mem-control-lru-and-reclaim mm/memcontrol.c
--- linux-2.6.23-rc2-mm2/mm/memcontrol.c~mem-control-lru-and-reclaim
2007-08-17 13:14:20.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/mm/memcontrol.c 2007-08-17 13:14:20.0 
+0530
@@ -24,8 +24,12 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 struct container_subsys mem_container_subsys;
+static const int MEM_CONTAINER_RECLAIM_RETRIES = 5;
 
 /*
  * The memory co

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Andi Kleen

On Friday 17 August 2007 05:42, Linus Torvalds wrote:
> On Fri, 17 Aug 2007, Paul Mackerras wrote:
> > I'm really surprised it's as much as a few K.  I tried it on powerpc
> > and it only saved 40 bytes (10 instructions) for a G5 config.
>
> One of the things that "volatile" generally screws up is a simple
>
>   volatile int i;
>
>   i++;

But for atomic_t people use atomic_inc() anyways which does this correctly.
It shouldn't really matter for atomic_t.

I'm worrying a bit that the volatile atomic_t change caused subtle code 
breakage like these delay read loops people here pointed out.
Wouldn't it be safer to just re-add the volatile to atomic_read() 
for 2.6.23? Or alternatively make it asm(), but volatile seems more
proven.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma

On Fri, 17 Aug 2007, Nick Piggin wrote:

> Satyam Sharma wrote:
> [...]
> > Granted, the above IS buggy code. But, the stated objective is to avoid
> > heisenbugs.
^^

> Anyway, why are you making up code snippets that are buggy in other
> ways in order to support this assertion being made that lots of kernel
> code supposedly depends on volatile semantics. Just reference the
> actual code.

Because the point is *not* about existing bugs in kernel code. At some
point Chris Snook (who started this thread) did write that "If I knew
of the existing bugs in the kernel, I would be sending patches for them,
not this series" or something to that effect.

The point is about *author expecations*. If people do expect atomic_read()
(or a variant thereof) to have volatile semantics, why not give them such
a variant?

And by the way, the point is *also* about the fact that cpu_relax(), as
of today, implies a full memory clobber, which is not what a lot of such
loops want. (due to stuff mentioned elsewhere, summarized in that summary)

> > And we have driver / subsystem maintainers such as Stefan
> > coming up and admitting that often a lot of code that's written to use
> > atomic_read() does assume the read will not be elided by the compiler.
 ^

(so it's about compiler barrier expectations only, though I fully agree
that those who're using atomic_t as if it were some magic thing that lets
them write lockless code are sorrily mistaken.)

> So these are broken on i386 and x86-64?

Possibly, but the point is not about existing bugs, as mentioned above.

Some such bugs have been found nonetheless -- reminds me, can somebody
please apply http://www.gossamer-threads.com/lists/linux/kernel/810674 ?

Satyam
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Fork Bombing Patch

2007-08-17 Thread Paul Jackson

I agree that (1) one risks overdoing comments and (2) one should minimize
comments inside a function body.

But I read your earlier statement:

  Please do not add comments inside functions.

as simply requesting no comments inside function bodies, without
exception.  That seems to me to be too strong a statement.

The "try to avoid" in the CodingStyle I read as meaning roughly
"minimize" which is consistent with what it further states, allowing
for "small comments" in certain cases.

Admittedly, I probably agree with you that the command that Anand
added in his patch was one of those comments that didn't add enough
value to be justified.

I was just quibbling over the wording of your objections to it.

 ... I'll shut up now.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC 3/4]Introduce "account modifiers" mechanism

2007-08-17 Thread Laurent Vivier

Rusty Russell wrote:
> On Fri, 2007-08-17 at 09:35 +0200, Laurent Vivier wrote:
>> Rusty Russell wrote:
>>> Hi Laurent,
>> Hi Rusty,
>> how are your puppies ?
> 
> They're getting a little fat, actually.  Too many features ...
> 
>> - remove PATCH 3, and add in task_struct a "ktime vtime" where we accumulate
>> guest time (by calling something like guest_enter() and guest_exit() from the
>> virtualization engine), and when in account_system_time() we have cputime >
>> vtime we substrate vtime from cputime and add vtime to user time and guest 
>> time.
>> But doing like this we freeze in kernel/sched.c the link between system time,
>> user time and guest time (i.e. system time = system time - vtime, user time =
>> user time + vtime and guest time = guest time + vtime).
> 
> Actually, I think we can set a per-cpu "in_guest" flag for the scheduler
> code, which then knows to add the tick to the guest time.  That seems
> the simplest possible solution.
> 
> lguest or kvm would set the flag before running the guest (which is done
> with preempt disabled or using preemption hooks), and reset it
> afterwards.
> 
> Thoughts?

It was my first attempt (except I didn't have a per-cpu flag, but a per-task
flag), it's not visible but I love simplicity... ;-)

A KVM VCPU is stopped by preemption, so when we enter in scheduler we have
exited from VCPU and thus this flags is off (so we account 0 to the guest). What
I did then is "set the flag on when we enter in the VCPU, and
"account_system_time()" sets the flag off when it adds this timeslice to cpustat
(and compute correctly guest, user, system time). But I didn't like this idea
because all code executed after we entered in the VCPU is accounted to the guest
until we have an account_system_time() and I suppose we can have real system
time in this part. And I guess a VCPU can be less than 1 ms (unit of cputime) in
a timeslice.

So ? What's best ?
Laurent
-- 
- [EMAIL PROTECTED]  --
  "Software is hard" - Donald Knuth

signature.asc
Description: OpenPGP digital signature

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin


Satyam Sharma wrote:


On Fri, 17 Aug 2007, Nick Piggin wrote:



Sure, now
that I learned of these properties I can start to audit code and insert
barriers where I believe they are needed, but this simply means that
almost all occurrences of atomic_read will get barriers (unless there
already are implicit but more or less obvious barriers like msleep).


You might find that these places that appear to need barriers are
buggy for other reasons anyway. Can you point to some in-tree code
we can have a look at?



Such code was mentioned elsewhere (query nodemgr_host_thread in cscope)
that managed to escape the requirement for a barrier only because of
some completely un-obvious compilation-unit-scope thing. But I find such
an non-explicit barrier quite bad taste. Stefan, do consider plunking an
explicit call to barrier() there.


It is very obvious. msleep calls schedule() (ie. sleeps), which is
always a barrier.

The "unobvious" thing is that you wanted to know how the compiler knows
a function is a barrier -- answer is that if it does not *know* it is not
a barrier, it must assume it is a barrier. If the whole msleep call chain
including the scheduler were defined static in the current compilation
unit, then it would still be a barrier because it would actually be able
to see the barriers in schedule(void), if nothing else.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin


Satyam Sharma wrote:


On Fri, 17 Aug 2007, Nick Piggin wrote:



Also, why would you want to make these insane accessors for atomic_t
types? Just make sure everybody knows the basics of barriers, and they
can apply that knowledge to atomic_t and all other lockless memory
accesses as well.



Code that looks like:

while (!atomic_read(&v)) {
...
cpu_relax_no_barrier();
forget(v.counter);

}

would be uglier. Also think about code such as:


I think they would both be equally ugly, but the atomic_read_volatile
variant would be more prone to subtle bugs because of the weird
implementation.

And it would be more ugly than introducing an order(x) statement for
all memory operations, and adding an order_atomic() wrapper for it
for atomic types.



a = atomic_read();
if (!a)
do_something();

forget();
a = atomic_read();
... /* some code that depends on value of a, obviously */

forget();
a = atomic_read();
...

So much explicit sprinkling of "forget()" looks ugly.


Firstly, why is it ugly? It's nice because of those nice explicit
statements there that give us a good heads up and would have some
comments attached to them (also, lack of the word "volatile" is
always a plus).

Secondly, what sort of code would do such a thing? In most cases,
it is probably riddled with bugs anyway (unless it is doing a
really specific sequence of interrupts or something, but in that
case it is very likely to either require locking or busy waits
anyway -> ie. barriers).



on the other hand, looks neater. The "_volatile()" suffix makes it also
no less explicit than an explicit barrier-like macro that this primitive
is something "special", for code clarity purposes.


Just don't use the word volatile, and have barriers both before
and after the memory operation, and I'm OK with it. I don't see
the point though, when you could just have a single barrier(x)
barrier function defined for all memory locations, rather than
this odd thing that only works for atomics (and would have to
be duplicated for atomic_set.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[-mm PATCH 3/9] Memory controller accounting setup (v6)

2007-08-17 Thread Balbir Singh


From: Pavel Emelianov <[EMAIL PROTECTED]>

Changelog for v5

1. Remove inclusion of memcontrol.h from mm_types.h

Changelog

As per Paul's review comments

1. Drop css_get() for the root memory container
2. Use mem_container_from_task() as an optimization instead of using
   mem_container_from_cont() along with task_container.

Basic setup routines, the mm_struct has a pointer to the container that
it belongs to and the the page has a page_container associated with it.

Signed-off-by: Pavel Emelianov <[EMAIL PROTECTED]>

Signed-off-by: <[EMAIL PROTECTED]>
---

 include/linux/memcontrol.h |   36 
 include/linux/mm_types.h   |6 
 include/linux/sched.h  |1 
 kernel/fork.c  |   11 ++--
 mm/memcontrol.c|   57 +
 5 files changed, 104 insertions(+), 7 deletions(-)

diff -puN include/linux/memcontrol.h~mem-control-accounting-setup 
include/linux/memcontrol.h
--- 
linux-2.6.23-rc2-mm2/include/linux/memcontrol.h~mem-control-accounting-setup
2007-08-17 13:14:19.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/include/linux/memcontrol.h  2007-08-17 
13:14:19.0 +0530
@@ -3,6 +3,9 @@
  * Copyright IBM Corporation, 2007
  * Author Balbir Singh <[EMAIL PROTECTED]>
  *
+ * Copyright 2007 OpenVZ SWsoft Inc
+ * Author: Pavel Emelianov <[EMAIL PROTECTED]>
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
  * the Free Software Foundation; either version 2 of the License, or
@@ -17,5 +20,38 @@
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
 
+struct mem_container;
+struct page_container;
+
+#ifdef CONFIG_CONTAINER_MEM_CONT
+
+extern void mm_init_container(struct mm_struct *mm, struct task_struct *p);
+extern void mm_free_container(struct mm_struct *mm);
+extern void page_assign_page_container(struct page *page,
+   struct page_container *pc);
+extern struct page_container *page_get_page_container(struct page *page);
+
+#else /* CONFIG_CONTAINER_MEM_CONT */
+static inline void mm_init_container(struct mm_struct *mm,
+   struct task_struct *p)
+{
+}
+
+static inline void mm_free_container(struct mm_struct *mm)
+{
+}
+
+static inline void page_assign_page_container(struct page *page,
+   struct page_container *pc)
+{
+}
+
+static inline struct page_container *page_get_page_container(struct page *page)
+{
+   return NULL;
+}
+
+#endif /* CONFIG_CONTAINER_MEM_CONT */
+
 #endif /* _LINUX_MEMCONTROL_H */
 
diff -puN include/linux/mm_types.h~mem-control-accounting-setup 
include/linux/mm_types.h
--- linux-2.6.23-rc2-mm2/include/linux/mm_types.h~mem-control-accounting-setup  
2007-08-17 13:14:19.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/include/linux/mm_types.h2007-08-17 
13:14:19.0 +0530
@@ -95,6 +95,9 @@ struct page {
unsigned int gfp_mask;
unsigned long trace[8];
 #endif
+#ifdef CONFIG_CONTAINER_MEM_CONT
+   unsigned long page_container;
+#endif
 };
 
 /*
@@ -226,6 +229,9 @@ struct mm_struct {
/* aio bits */
rwlock_tioctx_list_lock;
struct kioctx   *ioctx_list;
+#ifdef CONFIG_CONTAINER_MEM_CONT
+   struct mem_container *mem_container;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
diff -puN include/linux/sched.h~mem-control-accounting-setup 
include/linux/sched.h
--- linux-2.6.23-rc2-mm2/include/linux/sched.h~mem-control-accounting-setup 
2007-08-17 13:14:19.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/include/linux/sched.h   2007-08-17 
13:14:19.0 +0530
@@ -87,6 +87,7 @@ struct sched_param {
 
 #include 
 
+struct mem_container;
 struct exec_domain;
 struct futex_pi_state;
 struct bio;
diff -puN kernel/fork.c~mem-control-accounting-setup kernel/fork.c
--- linux-2.6.23-rc2-mm2/kernel/fork.c~mem-control-accounting-setup 
2007-08-17 13:14:19.0 +0530
+++ linux-2.6.23-rc2-mm2-balbir/kernel/fork.c   2007-08-17 13:14:19.0 
+0530
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -328,7 +329,7 @@ __cacheline_aligned_in_smp DEFINE_SPINLO
 
 #include 
 
-static struct mm_struct * mm_init(struct mm_struct * mm)
+static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
 {
atomic_set(&mm->mm_users, 1);
atomic_set(&mm->mm_count, 1);
@@ -345,11 +346,14 @@ static struct mm_struct * mm_init(struct
mm->ioctx_list = NULL;
mm->free_area_cache = TASK_UNMAPPED_BASE;
mm->cached_hole_size = ~0UL;
+   mm_init_container(mm, p);
 
if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
return mm;
}
+
+   mm_free_container(mm);
free_mm(mm);
return NULL;
 }
@@ -364,7 +368,7 @@ struct mm_struct * mm_alloc(void)

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin


Satyam Sharma wrote:


On Fri, 17 Aug 2007, Nick Piggin wrote:



Satyam Sharma wrote:
[...]


Granted, the above IS buggy code. But, the stated objective is to avoid
heisenbugs.


^^



Anyway, why are you making up code snippets that are buggy in other
ways in order to support this assertion being made that lots of kernel
code supposedly depends on volatile semantics. Just reference the
actual code.



Because the point is *not* about existing bugs in kernel code. At some
point Chris Snook (who started this thread) did write that "If I knew
of the existing bugs in the kernel, I would be sending patches for them,
not this series" or something to that effect.

The point is about *author expecations*. If people do expect atomic_read()
(or a variant thereof) to have volatile semantics, why not give them such
a variant?


Because they should be thinking about them in terms of barriers, over
which the compiler / CPU is not to reorder accesses or cache memory
operations, rather than "special" "volatile" accesses. Linux's whole
memory ordering and locking model is completely geared around the
former.



And by the way, the point is *also* about the fact that cpu_relax(), as
of today, implies a full memory clobber, which is not what a lot of such
loops want. (due to stuff mentioned elsewhere, summarized in that summary)


That's not the point, because as I also mentioned, the logical extention
to Linux's barrier API to handle this is the order(x) macro. Again, not
special volatile accessors.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Fix the return value of sys_set_tid_address()

2007-08-17 Thread Pavel Emelyanov


This call should return the virtual pid to the caller, just like
the sys_getpid()/sys_gettid() do, no the global one.

Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>
Signed-off-by: Alexey Dobriyan <[EMAIL PROTECTED]>
Cc: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Cc: Oleg Nesterov <[EMAIL PROTECTED]>

---

kernel/fork.c |2 +-
1 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 645c614..d7cff48 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -938,7 +938,7 @@ asmlinkage long sys_set_tid_address(int 
{

current->clear_child_tid = tidptr;

-   return current->pid;
+   return task_pid_vnr(current);
}

static inline void rt_mutex_init_task(struct task_struct *p)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Use find_task_by_pid_ns() in places that operate with virtual pids

2007-08-17 Thread Pavel Emelyanov


When the pid comes from the userspace, the find_task_by_pid_ns()
should be used to find the task by pid in particular (usually the
current) namespace. These places were lost in earlier patches.

Think over: all these places work like this:

if (pid == 0)
task = current;
else
task = find_task_by_pid_ns(pid);

the question is: does it worth introducing a common helper for such
case and (if it does) what should its name be?

Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>
Signed-off-by: Alexey Dobriyan <[EMAIL PROTECTED]>
Cc: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Cc: Oleg Nesterov <[EMAIL PROTECTED]>

---

This is a lost hunks of the -mm patch named
pid-namespaces-changes-to-show-virtual-ids-to-user.patch

fs/ioprio.c   |6 --
kernel/futex.c|6 --
kernel/futex_compat.c |3 ++-
kernel/sched.c|3 ++-
mm/mempolicy.c|3 ++-
mm/migrate.c  |3 ++-
6 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/ioprio.c b/fs/ioprio.c
index c99fe7b..0a615f8 100644
--- a/fs/ioprio.c
+++ b/fs/ioprio.c
@@ -94,7 +94,8 @@ asmlinkage long sys_ioprio_set(int which
if (!who)
p = current;
else
-   p = find_task_by_pid(who);
+   p = find_task_by_pid_ns(who,
+   current->nsproxy->pid_ns);
if (p)
ret = set_task_ioprio(p, ioprio);
break;
@@ -181,7 +182,8 @@ asmlinkage long sys_ioprio_get(int which
if (!who)
p = current;
else
-   p = find_task_by_pid(who);
+   p = find_task_by_pid_ns(who,
+   current->nsproxy->pid_ns);
if (p)
ret = get_task_ioprio(p);
break;
diff --git a/kernel/futex.c b/kernel/futex.c
index c61634a..0c1b777 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -444,7 +444,8 @@ static struct task_struct * futex_find_g
struct task_struct *p;

rcu_read_lock();
-   p = find_task_by_pid(pid);
+   p = find_task_by_pid_ns(pid,
+   current->nsproxy->pid_ns);

if (!p || ((current->euid != p->euid) && (current->euid != p->uid)))
p = ERR_PTR(-ESRCH);
@@ -1855,7 +1856,8 @@ sys_get_robust_list(int pid, struct robu

ret = -ESRCH;
rcu_read_lock();
-   p = find_task_by_pid(pid);
+   p = find_task_by_pid_ns(pid,
+   current->nsproxy->pid_ns);
if (!p)
goto err_unlock;
ret = -EPERM;
diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c
index f792136..e7563bf 100644
--- a/kernel/futex_compat.c
+++ b/kernel/futex_compat.c
@@ -116,7 +116,8 @@ compat_sys_get_robust_list(int pid, comp

ret = -ESRCH;
read_lock(&tasklist_lock);
-   p = find_task_by_pid(pid);
+   p = find_task_by_pid_ns(pid,
+   current->nsproxy->pid_ns);
if (!p)
goto err_unlock;
ret = -EPERM;
diff --git a/kernel/sched.c b/kernel/sched.c
index 4359b6b..1f81803 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4108,7 +4108,8 @@ struct task_struct *idle_task(int cpu)
 */
static inline struct task_struct *find_process_by_pid(pid_t pid)
{
-   return pid ? find_task_by_pid(pid) : current;
+   return pid ?
+   find_task_by_pid_ns(pid, current->nsproxy->pid_ns) : current;
}

/* Actually do priority change: must hold rq lock. */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e7eb2bb..934abc9 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -930,7 +930,8 @@ asmlinkage long sys_migrate_pages(pid_t 


/* Find the mm_struct */
read_lock(&tasklist_lock);
-   task = pid ? find_task_by_pid(pid) : current;
+   task = pid ?
+   find_task_by_pid_ns(pid, current->nsproxy->pid_ns) : current;
if (!task) {
read_unlock(&tasklist_lock);
return -ESRCH;
diff --git a/mm/migrate.c b/mm/migrate.c
index 477b894..7f43c5f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -917,7 +917,8 @@ asmlinkage long sys_move_pages(pid_t pid

/* Find the mm_struct */
read_lock(&tasklist_lock);
-   task = pid ? find_task_by_pid(pid) : current;
+   task = pid ?
+   find_task_by_pid_ns(pid, current->nsproxy->pid_ns) : current;
if (!task) {
read_unlock(&tasklist_lock);
return -ESRCH;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of

[RFC][PATCH] maps: show swapped out pages in smaps

2007-08-17 Thread Fengguang Wu

Show the amount of swapped out pages in /proc//smaps.
Currently there's no way to know who is using the swap file.

A possible better way to support it is to add a new counter to struct_mm.
Or maybe not that important at all?

Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
---
 fs/proc/task_mmu.c |   12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

--- linux-2.6.23-rc2-mm2.orig/fs/proc/task_mmu.c
+++ linux-2.6.23-rc2-mm2/fs/proc/task_mmu.c
@@ -324,6 +324,7 @@ struct mem_size_stats
unsigned long private_clean;
unsigned long private_dirty;
unsigned long referenced;
+   unsigned long swapped;
 
/*
 * Proportional Set Size(PSS): my share of RSS.
@@ -367,8 +368,11 @@ static int smaps_pte_range(pmd_t *pmd, u
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
for (; addr != end; pte++, addr += PAGE_SIZE) {
ptent = *pte;
-   if (!pte_present(ptent))
+   if (!pte_present(ptent)) {
+   if (!pte_none(ptent) && !pte_file(ptent))
+   mms->swapped += PAGE_SIZE;
continue;
+   }
 
mss->resident += PAGE_SIZE;
 
@@ -425,7 +429,8 @@ static int show_smap(struct seq_file *m,
   "Shared_Dirty:   %8lu kB\n"
   "Private_Clean:  %8lu kB\n"
   "Private_Dirty:  %8lu kB\n"
-  "Referenced: %8lu kB\n",
+  "Referenced: %8lu kB\n"
+  "Swapped:%8lu kB\n",
   (vma->vm_end - vma->vm_start) >> 10,
   sarg.mss.resident >> 10,
   (unsigned long)(sarg.mss.pss >> (10 + PSS_ERROR_BITS)),
@@ -433,7 +438,8 @@ static int show_smap(struct seq_file *m,
   sarg.mss.shared_dirty  >> 10,
   sarg.mss.private_clean >> 10,
   sarg.mss.private_dirty >> 10,
-  sarg.mss.referenced >> 10);
+  sarg.mss.referenced>> 10,
+  sarg.mss.swapped   >> 10);
 
return ret;
 }

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.22.3] ppp: fix output buffer size in ppp_decompress_frame

2007-08-17 Thread Konstantin Sharlaimov

This patch addresses the issue with "osize too small" errors in mppe encryption.
The patch fixes the issue with wrong output buffer size being passed to ppp
decompression routine.

Signed-off-by: Konstantin Sharlaimov <[EMAIL PROTECTED]>
---
As pointed out by Suresh Mahalingam, the issue addressed by 
ppp-fix-osize-too-small-errors-when-decoding patch is not fully resolved yet.
The size of allocated output buffer is correct, however it size passed to
ppp->rcomp->decompress in ppp_generic.c if wrong. The patch fixes that.

--- linux-2.6.21.3/drivers/net/ppp_generic.c.orig   2007-08-17 
20:35:03.0 +1100
+++ linux-2.6.21.3/drivers/net/ppp_generic.c2007-08-17 20:35:45.0 
+1100
@@ -1726,7 +1726,7 @@ ppp_decompress_frame(struct ppp *ppp, st
}
/* the decompressor still expects the A/C bytes in the hdr */
len = ppp->rcomp->decompress(ppp->rc_state, skb->data - 2,
-   skb->len + 2, ns->data, ppp->mru + PPP_HDRLEN);
+   skb->len + 2, ns->data, obuff_size);
if (len < 0) {
/* Pass the compressed frame to pppd as an
   error indication. */

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Smack: Simplified Mandatory Access Control Kernel

2007-08-17 Thread Miguel Ojeda

Btw, at:

diff -uprN -X linux-2.6.22-base/Documentation/dontdiff
linux-2.6.22-base/security/smack/Kconfig
linux-2.6.22/security/smack/Kconfig
--- linux-2.6.22-base/security/smack/Kconfig1969-12-31
16:00:00.0 -0800
+++ linux-2.6.22/security/smack/Kconfig 2007-07-10 01:08:05.0 -0700
@@ -0,0 +1,10 @@
+config SECURITY_SMACK
+   bool "Simplified Mandatory Access Control Kernel Support"
+   depends on NETLABEL && SECURITY_NETWORK
+   default n
+   help
+ This selects the Simplified Mandatory Access Control Kernel.
+ SMACK is useful for sensitivity, integrity, and a variety
+  of other madatory security schemes.
+ If you are unsure how to answer this question, answer N.
+

change:

+  of other madatory security schemes.

to:

+  of other mandatory security schemes.

-- 
Miguel Ojeda
http://maxextreme.googlepages.com/index.htm
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Paul Mackerras

Satyam Sharma writes:

> I wonder if this'll generate smaller and better code than _both_ the
> other atomic_read_volatile() variants. Would need to build allyesconfig
> on lots of diff arch's etc to test the theory though.

I'm sure it would be a tiny effect.

This whole thread is arguing about effects that are quite
insignificant.  On the one hand we have the non-volatile proponents,
who want to let the compiler do extra optimizations - which amounts to
letting it elide maybe a dozen loads in the whole kernel, loads which
would almost always be L1 cache hits.

On the other hand we have the volatile proponents, who are concerned
that some code somewhere in the kernel might be buggy without the
volatile behaviour, and who also want to be able to remove some
barriers and thus save a few bytes of code and a few loads here and
there (and possibly some stores too).

Either way the effect on code size and execution time is miniscule.

In the end the strongest argument is actually that gcc generates
unnecessarily verbose code on x86[-64] for volatile accesses.  Even
then we're only talking about ~2000 bytes, or less than 1 byte per
instance of atomic_read on average, about 0.06% of the kernel text
size.

The x86[-64] developers seem to be willing to bear the debugging cost
involved in having the non-volatile behaviour for atomic_read, in
order to save the 2kB.  That's fine with me.  Either way I think
somebody should audit all the uses of atomic_read, not just for
missing barriers, but also to find the places where it's used in a
racy manner.  Then we can work out where the races matter and fix them
if they do.

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma

On Fri, 17 Aug 2007, Nick Piggin wrote:

> Satyam Sharma wrote:
> > 
> > On Fri, 17 Aug 2007, Nick Piggin wrote:
> 
> > > > Sure, now
> > > > that I learned of these properties I can start to audit code and insert
> > > > barriers where I believe they are needed, but this simply means that
> > > > almost all occurrences of atomic_read will get barriers (unless there
> > > > already are implicit but more or less obvious barriers like msleep).
> > > 
> > > You might find that these places that appear to need barriers are
> > > buggy for other reasons anyway. Can you point to some in-tree code
> > > we can have a look at?
> > 
> > 
> > Such code was mentioned elsewhere (query nodemgr_host_thread in cscope)
> > that managed to escape the requirement for a barrier only because of
> > some completely un-obvious compilation-unit-scope thing. But I find such
> > an non-explicit barrier quite bad taste. Stefan, do consider plunking an
> > explicit call to barrier() there.
> 
> It is very obvious. msleep calls schedule() (ie. sleeps), which is
> always a barrier.

Probably you didn't mean that, but no, schedule() is not barrier because
it sleeps. It's a barrier because it's invisible.

> The "unobvious" thing is that you wanted to know how the compiler knows
> a function is a barrier -- answer is that if it does not *know* it is not
> a barrier, it must assume it is a barrier.

True, that's clearly what happens here. But are you're definitely joking
that this is "obvious" in terms of code-clarity, right?

Just 5 minutes back you mentioned elsewhere you like seeing lots of
explicit calls to barrier() (with comments, no less, hmm? :-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma

On Fri, 17 Aug 2007, Andi Kleen wrote:

> On Friday 17 August 2007 05:42, Linus Torvalds wrote:
> > On Fri, 17 Aug 2007, Paul Mackerras wrote:
> > > I'm really surprised it's as much as a few K.  I tried it on powerpc
> > > and it only saved 40 bytes (10 instructions) for a G5 config.
> >
> > One of the things that "volatile" generally screws up is a simple
> >
> > volatile int i;
> >
> > i++;
> 
> But for atomic_t people use atomic_inc() anyways which does this correctly.
> It shouldn't really matter for atomic_t.
> 
> I'm worrying a bit that the volatile atomic_t change caused subtle code 
> breakage like these delay read loops people here pointed out.

Umm, I followed most of the thread, but which breakage is this?

> Wouldn't it be safer to just re-add the volatile to atomic_read() 
> for 2.6.23? Or alternatively make it asm(), but volatile seems more
> proven.

The problem with volatile is not just trashy code generation (which also
definitely is a major problem), but definition holes, and implementation
inconsistencies. Making it asm() is not the only other alternative to
volatile either (read another reply to this mail), but considering most
of the thread has been about people not wanting even a
atomic_read_volatile() variant, making atomic_read() itself have volatile
semantics sounds ... strange :-)

PS: http://lkml.org/lkml/2007/8/15/407 was submitted a couple days back,
any word if you saw that?

I have another one for you:

[PATCH] i386, x86_64: __const_udelay() should not be marked inline

Because it can never get inlined in any callsite (each translation unit
is compiled separately for the kernel and so the implementation of
__const_udelay() would be invisible to all other callsites). In fact it
turns out, the correctness of callsites at arch/x86_64/kernel/crash.c:97
and arch/i386/kernel/crash.c:101 explicitly _depends_ upon it not being
inlined, and also it's an exported symbol (modules may want to call
mdelay() and udelay() that often becomes __const_udelay() after some
macro-ing in various headers). So let's not mark it as "inline" either.

Signed-off-by: Satyam Sharma <[EMAIL PROTECTED]>

---

 arch/i386/lib/delay.c   |2 +-
 arch/x86_64/lib/delay.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/i386/lib/delay.c b/arch/i386/lib/delay.c
index f6edb11..0082c99 100644
--- a/arch/i386/lib/delay.c
+++ b/arch/i386/lib/delay.c
@@ -74,7 +74,7 @@ void __delay(unsigned long loops)
delay_fn(loops);
 }

-inline void __const_udelay(unsigned long xloops)
+void __const_udelay(unsigned long xloops)
 {
int d0;

diff --git a/arch/x86_64/lib/delay.c b/arch/x86_64/lib/delay.c
index 2dbebd3..d0cd9cd 100644
--- a/arch/x86_64/lib/delay.c
+++ b/arch/x86_64/lib/delay.c
@@ -38,7 +38,7 @@ void __delay(unsigned long loops)
 }
 EXPORT_SYMBOL(__delay);

-inline void __const_udelay(unsigned long xloops)
+void __const_udelay(unsigned long xloops)
 {
__delay(((xloops * HZ * 
cpu_data[raw_smp_processor_id()].loops_per_jiffy) >> 32) + 1);
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Fix check for return value of create_pid_namespace

2007-08-17 Thread Pavel Emelyanov


The function in question returns ERR_PTR-s to indicate the
error, while the caller checks for return value to be NULL.

Found during testing the OpenVZ kernel with pid namespaces.

Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>
Signed-off-by: Alexey Dobriyan <[EMAIL PROTECTED]>
Cc: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Cc: Oleg Nesterov <[EMAIL PROTECTED]>

---
kernel/pid.c |2 +-
1 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index faad017..cb1f3cb 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -543,7 +543,7 @@ struct pid_namespace *copy_pid_ns(unsign
goto out_put;

new_ns = create_pid_namespace(old_ns->level + 1);
-   if (new_ns != NULL)
+   if (!IS_ERR(new_ns))
new_ns->parent = get_pid_ns(old_ns);

out_put:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma

On Fri, 17 Aug 2007, Nick Piggin wrote:

> Satyam Sharma wrote:
> > 
> > On Fri, 17 Aug 2007, Nick Piggin wrote:
> 
> > > Also, why would you want to make these insane accessors for atomic_t
> > > types? Just make sure everybody knows the basics of barriers, and they
> > > can apply that knowledge to atomic_t and all other lockless memory
> > > accesses as well.
> > 
> > 
> > Code that looks like:
> > 
> > while (!atomic_read(&v)) {
> > ...
> > cpu_relax_no_barrier();
> > forget(v.counter);
> > 
> > }
> > 
> > would be uglier. Also think about code such as:
> 
> I think they would both be equally ugly,

You think both these are equivalent in terms of "looks":

|
while (!atomic_read(&v)) {  |   while (!atomic_read_xxx(&v)) {
... |   ...
cpu_relax_no_barrier(); |   cpu_relax_no_barrier();
order_atomic(&v);   |   }
}   |

(where order_atomic() is an atomic_t
specific wrapper as you mentioned below)

?

Well, taste varies, but ...

> but the atomic_read_volatile
> variant would be more prone to subtle bugs because of the weird
> implementation.

What bugs?

> And it would be more ugly than introducing an order(x) statement for
> all memory operations, and adding an order_atomic() wrapper for it
> for atomic types.

Oh, that order() / forget() macro [forget() was named such by Chuck Ebbert
earlier in this thread where he first mentioned it, btw] could definitely
be generically introduced for any memory operations.

> > a = atomic_read();
> > if (!a)
> > do_something();
> > 
> > forget();
> > a = atomic_read();
> > ... /* some code that depends on value of a, obviously */
> > 
> > forget();
> > a = atomic_read();
> > ...
> > 
> > So much explicit sprinkling of "forget()" looks ugly.
> 
> Firstly, why is it ugly? It's nice because of those nice explicit
> statements there that give us a good heads up and would have some
> comments attached to them

atomic_read_xxx (where xxx = whatever naming sounds nice to you) would
obviously also give a heads up, and could also have some comments
attached to it.

> (also, lack of the word "volatile" is always a plus).

Ok, xxx != volatile.

> Secondly, what sort of code would do such a thing?

See the nodemgr_host_thread() that does something similar, though not
exactly same.

> > on the other hand, looks neater. The "_volatile()" suffix makes it also
> > no less explicit than an explicit barrier-like macro that this primitive
> > is something "special", for code clarity purposes.
> 
> Just don't use the word volatile,

That sounds amazingly frivolous, but hey, why not. As I said, ok,
xxx != volatile.

> and have barriers both before and after the memory operation,

How could that lead to bugs? (if you can point to existing code,
but just some testcase / sample code would be fine as well).

> [...] I don't see
> the point though, when you could just have a single barrier(x)
> barrier function defined for all memory locations,

As I said, barrier() is too heavy-handed.

> rather than
> this odd thing that only works for atomics

Why would it work only for atomics? You could use that generic macro
for anything you well damn please.

> (and would have to
> be duplicated for atomic_set.

#define atomic_set_xxx for something similar. Big deal ... NOT.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma

On Fri, 17 Aug 2007, Paul Mackerras wrote:

> Satyam Sharma writes:
> 
> > I wonder if this'll generate smaller and better code than _both_ the
> > other atomic_read_volatile() variants. Would need to build allyesconfig
> > on lots of diff arch's etc to test the theory though.
> 
> I'm sure it would be a tiny effect.
> 
> This whole thread is arguing about effects that are quite
> insignificant.

Hmm, the fact that this thread became what it did, probably means that
most developers on this list do not mind thinking/arguing about effects
or optimizations that are otherwise "tiny". But yeah, they are tiny
nonetheless.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nf_conntrack_ipv4 must be loaded explicitly

2007-08-17 Thread Jan Engelhardt


On Aug 2 2007 20:33, Patrick McHardy wrote:
>> End result:
>>
>> After loading nf_conntrack_ipv4.ko, everything works again (also with the
>> "bad" ff09b7). But I have to load it explicitly, and I think that
>> unfortunately breaks a lot of setups (such as mine) which assume ipv4
>> connection tracking is always there.
>
> I already have a patch for this queued. I'll push it upstream once
> I get a new power supply for the box I keep that tree on, hopefully
> tommorrow.

What's that patch looking like?



thanks,
Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma

On Fri, 17 Aug 2007, Nick Piggin wrote:

> Satyam Sharma wrote:
> > [...]
> > The point is about *author expecations*. If people do expect atomic_read()
> > (or a variant thereof) to have volatile semantics, why not give them such
> > a variant?
> 
> Because they should be thinking about them in terms of barriers, over
> which the compiler / CPU is not to reorder accesses or cache memory
> operations, rather than "special" "volatile" accesses.

This is obviously just a taste thing. Whether to have that forget(x)
barrier as something author should explicitly sprinkle appropriately
in appropriate places in the code by himself or use a primitive that
includes it itself.

I'm not saying "taste matters aren't important" (they are), but I'm really
skeptical if most folks would find the former tasteful.

> > And by the way, the point is *also* about the fact that cpu_relax(), as
> > of today, implies a full memory clobber, which is not what a lot of such
> > loops want. (due to stuff mentioned elsewhere, summarized in that summary)
> 
> That's not the point,

That's definitely the point, why not. This is why "barrier()", being
heavy-handed, is not the best option.

> because as I also mentioned, the logical extention
> to Linux's barrier API to handle this is the order(x) macro. Again, not
> special volatile accessors.

Sure, that forget(x) macro _is_ proposed to be made part of the generic
API. Doesn't explain why not to define/use primitives that has volatility
semantics in itself, though (taste matters apart).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

recursive use of bust_spinlocks()

2007-08-17 Thread Jan Beulich

Various architectures may call bust_spinlocks() recursively (when calling die() 
in
the context do an unresolved page fault); the function itself, however, doesn't
appear to be meant to be called in this manner. Nevertheless, this doesn't
appear to be a problem as long as bust_spinlocks(0) doesn't get called twice
in a row (otherwise, unblank_screen() may enter the scheduler). However, at
least on i386 die() has been capable of returning (and on other architectures
this should really be that way, too) when notify_die() returns NOTIFY_STOP.

The question now is: Should bust_spinlocks() increment/decrement
oops_in_progress (and kick klogd only when the count drops to zero), or
should we just avoid entering the scheduler by forcibly setting
oops_in_progress to 1 prior to calling unblank_screen(), or should all
architectures currently doing so avoid calling bust_spinlocks() recursively?

Further, many (if not all) architectures seem to have adopted the recursive
die() invocation protection. However, the logic there unconditionally calls
spin_unlock_irqrestore() (besides also allowing for bust_spinlocks() to be
used recursively), instead of undoing just what had been done for the
current function invocation.

Suggestions?

Thanks, Jan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [another git patch] move USB net drivers to drivers/net

2007-08-17 Thread Jan Engelhardt

Hi,


On May 10 2007 18:12, Jan Engelhardt wrote:
>Subject: Re: [another git patch] move USB net drivers to drivers/net
>
>Hi Jeff,
>
>
>On May 9 2007 21:38, Jeff Garzik wrote:
>>diff --git a/drivers/net/Makefile b/drivers/net/Makefile
>>index 59c0459..c5d8423 100644
>>--- a/drivers/net/Makefile
>>+++ b/drivers/net/Makefile
>>@@ -206,6 +206,14 @@ obj-$(CONFIG_TR) += tokenring/
>> obj-$(CONFIG_WAN) += wan/
>> obj-$(CONFIG_ARCNET) += arcnet/
>> obj-$(CONFIG_NET_PCMCIA) += pcmcia/
>>+
>>+obj-$(CONFIG_USB_CATC)  += usb/
>>+obj-$(CONFIG_USB_KAWETH)+= usb/
>>+obj-$(CONFIG_USB_PEGASUS)   += usb/
>>+obj-$(CONFIG_USB_RTL8150)   += usb/
>>+obj-$(CONFIG_USB_USBNET)+= usb/
>>+obj-$(CONFIG_USB_ZD1201)+= usb/
>>+
>
>This looks, well, a bit ugly. Since we seem to be going in the direction
>of using "menuconfig"s (Kconfig language object) anyway, I propose the
>patch below.


Please consider, thank you.



>
>diff --git a/drivers/net/usb/Kconfig b/drivers/net/usb/Kconfig
>new file mode 100644
>index 000..3de564b
>--- /dev/null
>+++ b/drivers/net/usb/Kconfig
>@@ -0,0 +1,338 @@
>+#
>+# USB Network devices configuration
>+#
>+comment "Networking support is needed for USB Network Adapter support"
>+   depends on USB && !NET
>
>This comment is superfluous. If !NET, then NETDEVICES is not even available.
>Since you just moved/renamed it, you are not to blame. Though, I remove
>it in below's patch (applies on top of
>commit 5b2fc499917e5897a13add780e181b4cef197072).
>
>
>Signed-off-by: Jan Engelhardt <[EMAIL PROTECTED]>
>
>diff --git a/drivers/net/Makefile b/drivers/net/Makefile
>index c5d8423..5f6112d 100644
>--- a/drivers/net/Makefile
>+++ b/drivers/net/Makefile
>@@ -207,12 +207,7 @@ obj-$(CONFIG_WAN) += wan/
> obj-$(CONFIG_ARCNET) += arcnet/
> obj-$(CONFIG_NET_PCMCIA) += pcmcia/
> 
>-obj-$(CONFIG_USB_CATC)  += usb/
>-obj-$(CONFIG_USB_KAWETH)+= usb/
>-obj-$(CONFIG_USB_PEGASUS)   += usb/
>-obj-$(CONFIG_USB_RTL8150)   += usb/
>-obj-$(CONFIG_USB_USBNET)+= usb/
>-obj-$(CONFIG_USB_ZD1201)+= usb/
>+obj-$(CONFIG_NETDEV_USB) += usb/
> 
> obj-y += wireless/
> obj-$(CONFIG_NET_TULIP) += tulip/
>diff --git a/drivers/net/usb/Kconfig b/drivers/net/usb/Kconfig
>index 3de564b..2458f99 100644
>--- a/drivers/net/usb/Kconfig
>+++ b/drivers/net/usb/Kconfig
>@@ -1,11 +1,12 @@
> #
> # USB Network devices configuration
> #
>-comment "Networking support is needed for USB Network Adapter support"
>-  depends on USB && !NET
>-
>-menu "USB Network Adapters"
>+menuconfig NETDEVICES_USB
>+  bool "USB Network Adapters"
>   depends on USB && NET
>+  default y
>+
>+if NETDEVICES_USB
> 
> config USB_CATC
>   tristate "USB CATC NetMate-based Ethernet device support (EXPERIMENTAL)"
>@@ -334,5 +335,4 @@ config USB_NET_ZAURUS
> really need this non-conformant variant of CDC Ethernet (or in
> some cases CDC MDLM) protocol, not "g_ether".
> 
>-
>-endmenu
>+endif # NETDEVICES_USB
>#
>
>   Jan
>-- 

Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] add kdump_after_notifier

2007-08-17 Thread Vivek Goyal

On Thu, Aug 16, 2007 at 06:26:35PM +0900, Takenori Nagano wrote:
> Vivek Goyal wrote:
>  > So for the time being I think we can put RAS tools on die notifier list
> > and if it runs into issues we can always think of creating a separate list.
> > 
> > Few things come to mind.
> > 
> > - Why there is a separate panic_notifier_list? Can't it be merged with
> >   die_chain? die_val already got one of the event type as PANIC. If there
> >   are no specific reasons then we should merge the two lists. Registering
> >   RAS tools on a single list is easier.
> 
> I think it is difficult, because die_chain is defined by each architecture.
> 

I think die_chain is arch independent definition (kernel/die_notifier.c)? 
But anyway, to begin with it can be done only for panic_notifier.

> > - Modify Kdump to register on die_chain list. 
> > - Modify Kdb to register on die_chain list.
> > - Export all the registered members of die_chain through sysfs along with
> >   their priorities. Priorities should be modifiable. Most likely one 
> >   shall have to introduce additional field in struct notifier_block. This
> >   field will be a string as an identifier of the user registerd. e.g
> >   "Kdump", "Kdb" etc.
> > 
> > Now user will be able to view all the die_chain users through sysfs and
> > be able to modify the order in which these should run by modifying their
> > priority. Hence all the RAS tools can co-exist.
> 
> This is my image of your proposal.
> 
> - Print current order
> 
> # cat /sys/class/misc/debug/panic_notifier_list
> priority   name
> 1  IPMI
> 2  watchdog
> 3  Kdb
> 4  Kdump
> 

I think Bernhard's suggestion looks better here. I noticed that 
/sys/kernel/debug is already present. So how about following.

/sys/kernel/debug/kdump/priority
/sys/kernel/debug/kdb/priority
/sys/kernel/debug/IPMI/priority

I think at some point of time we shall have to create another file say
description.

/sys/kernel/debug/IPMI/description

Which can tell what does this tool do? Other a user might not have any
clue how to prioritize various things.

Thanks
Vivek
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Stefan Richter

I wrote:
> Nick Piggin wrote:
>> You might find that these places that appear to need barriers are
>> buggy for other reasons anyway. Can you point to some in-tree code
>> we can have a look at?
> 
> I could, or could not, if I were through with auditing the code.  I
> remembered one case and posted it (nodemgr_host_thread) which was safe
> because msleep_interruptible provided the necessary barrier there, and
> this implicit barrier is not in danger to be removed by future patches.

PS, just in case anybody holds his breath for more example code from me,
I don't plan to continue with an actual audit of the drivers I maintain.
It's an important issue, but my current time budget will restrict me to
look at it ad hoc, per case.  (Open bugs have higher priority than
potential bugs.)
-- 
Stefan Richter
-=-=-=== =--- =---=
http://arcgraph.de/sr/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rtc: Make rtc-ds1742 driver hotplug-aware

2007-08-17 Thread Jan Engelhardt


On Aug 17 2007 01:06, Atsushi Nemoto wrote:
>Add an MODULE_ALIAS() to make this platform driver hotplug-aware.
>
>Signed-off-by: Atsushi Nemoto <[EMAIL PROTECTED]>
>---
>diff --git a/drivers/rtc/rtc-ds1742.c b/drivers/rtc/rtc-ds1742.c
>index b2e5481..4bd22dc 100644
>--- a/drivers/rtc/rtc-ds1742.c
>+++ b/drivers/rtc/rtc-ds1742.c
>@@ -273,3 +273,4 @@ MODULE_AUTHOR("Atsushi Nemoto <[EMAIL PROTECTED]>");
> MODULE_DESCRIPTION("Dallas DS1742 RTC driver");
> MODULE_LICENSE("GPL");
> MODULE_VERSION(DRV_VERSION);
>+MODULE_ALIAS("ds1742");

Why exactly is this needed? What script refers to the module as ds1742 instead
of rtc-ds1742? Regular hotplug (e.g. udev/modprobe) also go by PCI ID or
whatever is applicable and load the module which provides support for said ID.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH]: proc: export a processes resource limits via proc/

2007-08-17 Thread Neil Horman

On Fri, Aug 17, 2007 at 04:09:26AM -0400, [EMAIL PROTECTED] wrote:
> On Thu, 16 Aug 2007 08:35:38 EDT, Neil Horman said:
> > Hey again-
> > Andrew requested that I repost this cleanly, after running the patch
> > through checkpatch.  As requested here it is with the changelog.
> > 
> > Currently, there exists no method for a process to query the resource
> > limits of another process.  They can be inferred via some mechanisms but 
> > they
> > cannot be explicitly determined.  Given that this information can be 
> > usefull 
> to
> > know during the debugging of an application, I've written this patch which
> > exports all of a processes limits via /proc//limits.  
> > 
> > Tested successfully by myself on x86 on top of 2.6.23-rc2-mm1.
> 
> I had only one comment the first time around, and Neil addressed it.
> 
> I've also tested on x86_64 23-rc2-mm1, and it works here too.  I saw where 
> this
> uses units of 'bytes' while the shell 'ulimit' uses 1024-byte units in some
> places, but (a) this lists the units and (b) it's consistent with setrlimit().
> Testing with values >4G show it's 64-bit clean as well.
> 
> One question:  Is the units milliseconds, or seconds here:
> 
> + [RLIMIT_CPU] = {"Max cpu time", "ms"},
> 
> Other than that, feel free to stick either/both of these on:
> 
> Reviewed-By: Valdis Kletnieks <[EMAIL PROTECTED]>
> Tested-By: Valdis Kletnieks <[EMAIL PROTECTED]>

Oops, your right,  Thanks. New patch

Currently, there exists no method for a process to query the resource
limits of another process.  They can be inferred via some mechanisms but they
cannot be explicitly determined.  Given that this information can be usefull to
know during the debugging of an application, I've written this patch which
exports all of a processes limits via /proc//limits.

Thanks 
Neil

Signed-off-by: Neil Horman <[EMAIL PROTECTED]>


 base.c |   69 +
 1 file changed, 69 insertions(+)


commit 95e91e3102d7963e63d68969b8db5694b83a6684
Author: Neil Horman <[EMAIL PROTECTED]>
Date:   Thu Aug 16 08:25:59 2007 -0400

Readding proc_pid_limits patch with checkpatch.pl run

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ed2b224..4fb34a5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -74,6 +74,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 /* NOTE:
@@ -323,6 +324,72 @@ static int proc_oom_score(struct task_struct *task, char 
*buffer)
return sprintf(buffer, "%lu\n", points);
 }
 
+struct limit_names {
+   char *name;
+   char *unit;
+};
+
+static const struct limit_names lnames[RLIM_NLIMITS] = {
+   [RLIMIT_CPU] = {"Max cpu time", "seconds"},
+   [RLIMIT_FSIZE] = {"Max file size", "bytes"},
+   [RLIMIT_DATA] = {"Max data size", "bytes"},
+   [RLIMIT_STACK] = {"Max stack size", "bytes"},
+   [RLIMIT_CORE] = {"Max core file size", "bytes"},
+   [RLIMIT_RSS] = {"Max resident set", "bytes"},
+   [RLIMIT_NPROC] = {"Max processes", "processes"},
+   [RLIMIT_NOFILE] = {"Max open files", "files"},
+   [RLIMIT_MEMLOCK] = {"Max locked memory", "bytes"},
+   [RLIMIT_AS] = {"Max address space", "bytes"},
+   [RLIMIT_LOCKS] = {"Max file locks", "locks"},
+   [RLIMIT_SIGPENDING] = {"Max pending signals", "signals"},
+   [RLIMIT_MSGQUEUE] = {"Max msgqueue size", "bytes"},
+   [RLIMIT_NICE] = {"Max nice priority", NULL},
+   [RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
+};
+
+/* Display limits for a process */
+static int proc_pid_limits(struct task_struct *task, char *buffer)
+{
+   unsigned int i;
+   int count = 0;
+   char *bufptr = buffer;
+
+   struct rlimit rlim[RLIM_NLIMITS];
+
+   read_lock(&tasklist_lock);
+   memcpy(rlim, task->signal->rlim, sizeof(struct rlimit) * RLIM_NLIMITS);
+   read_unlock(&tasklist_lock);
+
+   /*
+* print the file header
+*/
+   count += sprintf(&bufptr[count], "%-25s %-20s %-20s %-10s\n",
+   "Limit", "Soft Limit", "Hard Limit", "Units");
+
+   for (i = 0; i < RLIM_NLIMITS; i++) {
+   if (rlim[i].rlim_cur == RLIM_INFINITY)
+   count += sprintf(&bufptr[count], "%-25s %-20s ",
+lnames[i].name, "unlimited");
+   else
+   count += sprintf(&bufptr[count], "%-25s %-20lu ",
+lnames[i].name, rlim[i].rlim_cur);
+
+   if (rlim[i].rlim_max == RLIM_INFINITY)
+   count += sprintf(&bufptr[count], "%-20s ", "unlimited");
+   else
+   count += sprintf(&bufptr[count], "%-20lu ",
+rlim[i].rlim_max);
+
+   if (lnames[i].unit)
+   count += sprintf(&bufptr[count], "%-10s\n",
+lnames[i].unit);
+   else
+

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Stefan Richter

Nick Piggin wrote:
> Stefan Richter wrote:
>> For architecture port authors, there is Documentation/atomic_ops.txt.
>> Driver authors also can learn something from that document, as it
>> indirectly documents the atomic_t and bitops APIs.
> 
> "Semantics and Behavior of Atomic and Bitmask Operations" is
> pretty direct :)

"Indirect", "pretty direct"... It's subjective.

(It is not an API documentation; it is an implementation specification.)

> Sure, it says that it's for arch maintainers, but there is no
> reason why users can't make use of it.
> 
> 
>> Prompted by this thread, I reread this document, and indeed, the
>> sentence "Unlike the above routines, it is required that explicit memory
>> barriers are performed before and after [atomic_{inc,dec}_return]"
>> indicates that atomic_read (one of the "above routines") is very
>> different from all other atomic_t accessors that return values.
>> 
>> This is strange.  Why is it that atomic_read stands out that way?  IMO
> 
> It is not just atomic_read of course. It is atomic_add,sub,inc,dec,set.

Yes, but unlike these, atomic_read returns a value.

Without me (the API user) providing extra barriers, that value may
become something else whenever someone touches code in the vicinity of
the atomic_read.

>> this API imbalance is quite unexpected by many people.  Wouldn't it be
>> beneficial to change the atomic_read API to behave the same like all
>> other atomic_t accessors that return values?
> 
> It is very consistent and well defined. Operations which both modify
> the data _and_ return something are defined to have full barriers
> before and after.

You are right, atomic_read is not only different from accessors that
don't retunr values, it is also different from all other accessors that
return values (because they all also modify the value).  There is just
no actual API documentation, which contributes to the issue that some
people (or at least one: me) learn a little bit late how special
atomic_read is.

> What do you want to add to the other atomic accessors? Full memory
> barriers? Only compiler barriers? It's quite likely that if you think
> some barriers will fix bugs, then there are other bugs lurking there
> anyway.

A lot of different though related issues are discussed in this thread,
but I personally am only occupied by one particular thing:  What kind of
return values do I get from atomic_read.

> Just use spinlocks if you're not absolutely clear about potential
> races and memory ordering issues -- they're pretty cheap and simple.

Probably good advice, like generally if driver guys consider lockless
algorithms.

>> OK, it is also different from the other accessors that return data in so
>> far as it doesn't modify the data.  But as driver "author", i.e. user of
>> the API, I can't see much use of an atomic_read that can be reordered
>> and, more importantly, can be optimized away by the compiler.
> 
> It will return to you an atomic snapshot of the data (loaded from
> memory at some point since the last compiler barrier). All you have
> to be aware of compiler barriers and the Linux SMP memory ordering
> model, which should be a given if you are writing lockless code.

OK, that's what I slowly realized during this discussion, and I
appreciate the explanations that were given here.

>> Sure, now
>> that I learned of these properties I can start to audit code and insert
>> barriers where I believe they are needed, but this simply means that
>> almost all occurrences of atomic_read will get barriers (unless there
>> already are implicit but more or less obvious barriers like msleep).
> 
> You might find that these places that appear to need barriers are
> buggy for other reasons anyway. Can you point to some in-tree code
> we can have a look at?

I could, or could not, if I were through with auditing the code.  I
remembered one case and posted it (nodemgr_host_thread) which was safe
because msleep_interruptible provided the necessary barrier there, and
this implicit barrier is not in danger to be removed by future patches.
-- 
Stefan Richter
-=-=-=== =--- =---=
http://arcgraph.de/sr/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nanosleep() accuracy

2007-08-17 Thread Michal Schmidt

GolovaSteek wrote:
> 2007/8/17, Michal Schmidt <[EMAIL PROTECTED]>:
>   
>> GolovaSteek skrev:
>> 
>>> Hello!
>>> I need use sleep with accurat timing.
>>> I use 2.6.21 with rt-prempt patch.
>>> with enabled rt_preempt, dyn_ticks, and local_apic
>>> But
>>>
>>> req.tv_nsec = 30;
>>> req.tv_sec = 0;
>>> nanosleep(&req,NULL)
>>>
>>> make pause around 310-330 microseconds.
>>>   
>> How do you measure this?
>> If you want to have something done every 300 microseconds, you must not
>> sleep for 300 microseconds in each iteration, because you'd accumulate
>> errors. Use a periodic timer or use the current time to compute how long
>> to sleep in each iteration. Take a look how cyclictest does it.
>> 
>
> no. I just want my programm go to sleep sometimes and wake up in correct time.
>   

What does your program do that it has such a strict requirement on the
exact length of sleeping?

>>> I tried to understend how work nanosleep(), but it not depends from
>>> jiffies and from smp_apic_timer_interrupt.
>>>
>>> When can accuracy be lost?
>>> And how are process waked up?
>>>
>>>
>>> GolovaSteek
>>>   
>> Don't forget the process will always have non-zero wakeup latency. It
>> takes some time to process an interrupt, wakeup the process and schedule
>> it to run on the CPU. 10-30 microseconds is not unreasonable.
>> 
>
> But 2 operations can be done in 10 microseconds?
> and why is there that inconstancy? Why sametimes 10 and sometimes 30?
> In which points of implementation it happens?
>
> GolovaSteek
>   

If a jitter of 20 microseconds is unacceptable for your application,
don't use PC hardware. Consider using a microcontroller.

Michal

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Stefan Richter

Nick Piggin wrote:
> Satyam Sharma wrote:
>> And we have driver / subsystem maintainers such as Stefan
>> coming up and admitting that often a lot of code that's written to use
>> atomic_read() does assume the read will not be elided by the compiler.
> 
> So these are broken on i386 and x86-64?

The ieee1394 and firewire subsystems have open, undiagnosed bugs, also
on i386 and x86-64.  But whether there is any bug because of wrong
assumptions about atomic_read among them, I don't know.  I don't know
which assumptions the authors made, I only know that I wasn't aware of
all the properties of atomic_read until now.

> Are they definitely safe on SMP and weakly ordered machines with
> just a simple compiler barrier there? Because I would not be
> surprised if there are a lot of developers who don't really know
> what to assume when it comes to memory ordering issues.
> 
> This is not a dig at driver writers: we still have memory ordering
> problems in the VM too (and probably most of the subtle bugs in
> lockless VM code are memory ordering ones). Let's not make up a
> false sense of security and hope that sprinkling volatile around
> will allow people to write bug-free lockless code. If a writer
> can't be bothered reading API documentation

...or, if there is none, the implementation specification (as in case of
the atomic ops), or, if there is none, the implementation (as in case of
a some infrastructure code here and there)...

> and learning the Linux memory model, they can still be productive
> writing safely locked code.

Provided they are aware that they might not have the full picture of the
lockless primitives.  :-)
-- 
Stefan Richter
-=-=-=== =--- =---=
http://arcgraph.de/sr/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Unable to handle kernel NULL pointer dereference at 0000000000000001 RIP

2007-08-17 Thread Andreas Moroder


Hello,

our SLES production system
2.6.5-7.252-smp #1 SMP Tue Feb 14 11:11:04 UTC 2006 x86_64 x86_64 x86_64
stopped with the following messages in the log
Is this a known bug in this version and is it solved in the next versions.

As I wrote it is a production server so I can't change away from reiserfs.

Thanks
Andreas Moroder

Aug 16 22:00:07 falcon1 su: pam_unix2: session started for user oracle, 
service su
Aug 16 22:00:37 falcon1 kernel: Unable to handle kernel NULL pointer 
dereference at 0001 RIP:
Aug 16 22:00:37 falcon1 kernel: 
{:reiserfs:reiserfs_clear_inode+22}

Aug 16 22:00:37 falcon1 kernel: PML4 12c07a067 PGD 103721067 PMD 0
Aug 16 22:00:37 falcon1 kernel: Oops: 0002 [1] SMP
Aug 16 22:00:37 falcon1 kernel: CPU 2
Aug 16 22:00:37 falcon1 kernel: Pid: 224, comm: kswapd0 Tainted: P   UM 
 (2.6.5-7.252-smp SLES9_SP3_BRANCH-2006021404)
Aug 16 22:00:37 falcon1 kernel: RIP: 0010:[] 
{:reiserfs:reiserfs_clear_inode+22}

Aug 16 22:00:37 falcon1 kernel: RSP: :01013fdfdc08  EFLAGS: 00010213
Aug 16 22:00:37 falcon1 kernel: RAX: a00b3d10 RBX: 
0100b8fd49f8 RCX: 8053fa80
Aug 16 22:00:37 falcon1 kernel: RDX: 01013fdfdc78 RSI: 
0100b8fd4680 RDI: 0001
Aug 16 22:00:37 falcon1 kernel: RBP: 0100b8fd49f8 R08: 
01000a75c810 R09: 0001
Aug 16 22:00:37 falcon1 kernel: R10:  R11: 
fb11 R12: 01013fdfdc78
Aug 16 22:00:37 falcon1 kernel: R13: 0031 R14: 
803e6b20 R15: 01013fdfdc78
Aug 16 22:00:37 falcon1 kernel: FS:  () 
GS:8057d000() knlGS:
Aug 16 22:00:37 falcon1 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 
8005003b
Aug 16 22:00:37 falcon1 kernel: CR2: 0001 CR3: 
0a6b1000 CR4: 06e0
Aug 16 22:00:37 falcon1 kernel: Process kswapd0 (pid: 224, threadinfo 
01013fdfc000, task 01013fe2aa40)
Aug 16 22:00:37 falcon1 kernel: Stack: 0100b8fd49f8 801aa7eb 
0100b8fd4a08 801aa898
Aug 16 22:00:37 falcon1 kernel:010005147530 010046a47d08 
010046a47cf8 0080

Aug 16 22:00:37 falcon1 kernel:0080 801aab17
Aug 16 22:00:37 falcon1 kernel: Call 
Trace:{clear_inode+155} 
{dispose_list+104}
Aug 16 22:00:37 falcon1 kernel: 
{shrink_icache_memory+471} 
{shrink_slab+245}
Aug 16 22:00:37 falcon1 kernel: 
{balance_pgdat+477} {kswapd+390}
Aug 16 22:00:37 falcon1 kernel:{do_exit+3505} 
{autoremove_wake_function+0}
Aug 16 22:00:37 falcon1 kernel: 
{autoremove_wake_function+0} 
{child_rip+8}
Aug 16 22:00:37 falcon1 kernel:{kswapd+0} 
{child_rip+0}

Aug 16 22:00:37 falcon1 kernel:
Aug 16 22:00:37 falcon1 kernel:
Aug 16 22:00:37 falcon1 kernel: Code: f0 ff 0f 0f 94 c0 84 c0 74 05 e8 
0b a4 0b e0 48 8b 7b e0 48
Aug 16 22:00:37 falcon1 kernel: RIP 
{:reiserfs:reiserfs_clear_inode+22} RSP <01013fdfdc08>

Aug 16 22:00:37 falcon1 kernel: CR2: 0001

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Unable to handle kernel NULL pointer dereference at 0000000000000001 RIP

2007-08-17 Thread Jiri Kosina

On Fri, 17 Aug 2007, Andreas Moroder wrote:

> our SLES production system 2.6.5-7.252-smp #1 SMP Tue Feb 14 11:11:04 
> UTC 2006 x86_64 x86_64 x86_64 stopped with the following messages in the 
> log Is this a known bug in this version and is it solved in the next 
> versions.

Hi,

this is vendor-provided kernel, so this is pretty much offtopic on lkml -- 
please file a bugreport in Novell bugzilla, which is a proper place for 
this.

Thanks,

-- 
Jiri Kosina
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6.23-rc3] NFSv4 client oops

2007-08-17 Thread Trond Myklebust

On Wed, 2007-08-15 at 18:08 -0400, Jeff Garzik wrote:
> While upgrading nfs-utils on my NFSv4 file server (F7 x86-64 
> 2.6.23-rc3), I got an oops on the NFSv4 client (FC6 x86-64 2.6.23-rc3), 
> and communications stopped.
> 
> I rebooted the client, and everything was fine again.  Then, on the 
> server, I removed an extra nfs-utils package left over from FC6, causing 
> the nfs daemons to be restarted.  Another oops.
> 
> So it seems like whatever Fedora does during an nfs-utils upgrade __on 
> the server__ is causing a remote oops __on the client__.
> 
> NFSv4 client kernel config also attached.
> 
> Never seen this problem before, but then again, never did a remote OS 
> upgrade of the fileserver while running NFSv4 before.
> 
>   Jeff
> 
> 
> plain text document attachment (log.txt)
> Aug 15 15:44:23 core kernel: nfs: server pretzel not responding, timed out
> Aug 15 15:44:23 core last message repeated 8 times
> Aug 15 15:44:23 core kernel: [ cut here ]
> Aug 15 15:44:23 core kernel: kernel BUG at fs/nfs/nfs4xdr.c:1040!
> Aug 15 15:44:23 core kernel: invalid opcode:  [1] SMP 
> Aug 15 15:44:23 core kernel: CPU 0 
> Aug 15 15:44:23 core kernel: Modules linked in: nfs lockd sunrpc af_packet 
> ipv6 cpufreq_ondemand acpi_cpufreq battery floppy nvram sg snd_hda_intel 
> ata_generic snd_pcm_oss snd_mixer_oss snd_pcm i2c_i801 snd_page_alloc e1000 
> firewire_ohci ata_piix i2c_core sr_mod cdrom sata_sil ahci libata sd_mod 
> scsi_mod ext3 jbd ehci_hcd uhci_hcd
> Aug 15 15:44:23 core kernel: Pid: 16353, comm: 10.10.10.1-recl Not tainted 
> 2.6.23-rc3 #1
> Aug 15 15:44:23 core kernel: RIP: 0010:[]  
> [] :nfs:encode_open+0x1c0/0x330
> Aug 15 15:44:23 core kernel: RSP: 0018:8100467c5c60  EFLAGS: 00010202
> Aug 15 15:44:23 core kernel: RAX: 81000f89b8b8 RBX: 697a6f6d RCX: 
> 81000f89b8b8
> Aug 15 15:44:23 core kernel: RDX: 0004 RSI: 0004 RDI: 
> 8100467c5c80
> Aug 15 15:44:23 core kernel: RBP: 8100467c5c80 R08: 81000f89bc30 R09: 
> 81000f89b83f
> Aug 15 15:44:23 core kernel: R10: 0001 R11: 881e79e0 R12: 
> 81003cbd1808
> Aug 15 15:44:23 core kernel: R13: 81000f89b860 R14: 81005fc984e0 R15: 
> 88240af0
> Aug 15 15:44:23 core kernel: FS:  () 
> GS:8052a000() knlGS:
> Aug 15 15:44:23 core kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
> Aug 15 15:44:23 core kernel: CR2: 2adb9e51a030 CR3: 7ea7e000 CR4: 
> 06e0
> Aug 15 15:44:23 core kernel: DR0:  DR1:  DR2: 
> 
> Aug 15 15:44:23 core kernel: DR3:  DR6: 0ff0 DR7: 
> 0400
> Aug 15 15:44:23 core kernel: Process 10.10.10.1-recl (pid: 16353, threadinfo 
> 8100467c4000, task 8100038ce780)
> Aug 15 15:44:23 core kernel: Stack:  81004aeb6a40 81003cbd1808 
> 81003cbd1808 88240b5d
> Aug 15 15:44:23 core kernel:  81000f89b8bc 81005fc984e8 
> 81000f89bc30 81005fc984e8
> Aug 15 15:44:23 core kernel:  0003  
>  81003cbd1800
> Aug 15 15:44:23 core kernel: Call Trace:
> Aug 15 15:44:23 core kernel:  [] 
> :nfs:nfs4_xdr_enc_open_noattr+0x6d/0x90
> Aug 15 15:44:23 core kernel:  [] 
> :sunrpc:rpcauth_wrap_req+0x97/0xf0
> Aug 15 15:44:23 core kernel:  [] 
> :nfs:nfs4_xdr_enc_open_noattr+0x0/0x90
> Aug 15 15:44:23 core kernel:  [] 
> :sunrpc:call_transmit+0x18a/0x290
> Aug 15 15:44:23 core kernel:  [] 
> :sunrpc:__rpc_execute+0x6b/0x290
> Aug 15 15:44:23 core kernel:  [] 
> :sunrpc:rpc_do_run_task+0x76/0xd0
> Aug 15 15:44:23 core kernel:  [] 
> :nfs:_nfs4_proc_open+0x76/0x230
> Aug 15 15:44:23 core kernel:  [] 
> :nfs:nfs4_open_recover_helper+0x5e/0xc0
> Aug 15 15:44:23 core kernel:  [] 
> :nfs:nfs4_open_recover+0xe4/0x120
> Aug 15 15:44:23 core kernel:  [] 
> :nfs:nfs4_open_reclaim+0xa4/0xf0
> Aug 15 15:44:23 core kernel:  [] 
> :nfs:nfs4_reclaim_open_state+0x55/0x1b0
> Aug 15 15:44:23 core kernel:  [] :nfs:reclaimer+0x2ca/0x390
> Aug 15 15:44:23 core kernel:  [] :nfs:reclaimer+0x0/0x390
> Aug 15 15:44:23 core kernel:  [] kthread+0x4b/0x80
> Aug 15 15:44:23 core kernel:  [] child_rip+0xa/0x12
> Aug 15 15:44:23 core kernel:  [] kthread+0x0/0x80
> Aug 15 15:44:23 core kernel:  [] child_rip+0x0/0x12
> Aug 15 15:44:23 core kernel: 
> Aug 15 15:44:23 core kernel: 
> Aug 15 15:44:23 core kernel: Code: 0f 0b eb fe 48 89 ef c7 00 00 00 00 02 be 
> 08 00 00 00 e8 79 
> Aug 15 15:44:23 core kernel: RIP  [] 
> :nfs:encode_open+0x1c0/0x330
> Aug 15 15:44:23 core kernel:  RSP 
> Aug 15 15:44:23 core kernel: nfs: server pretzel OK
> Aug 15 15:45:02 core shutdown[16390]: shutting down for system reboot
> 
> [...]

OK. That probably indicates an error in the calculation of
NFS4_enc_open_noattr_sz (it that the resulting value is too small). I'll
have a look this weekend...

Cheers
  Trond

-
To unsubscribe from thi

Re: [PATCH/RFC 4/4, second shot]KVM uses "account_guest_time()"

2007-08-17 Thread Laurent Vivier

KVM updates vtime in task_struct to allow account_guest_time() to modify user,
system and guest time in cpustat accordingly.
Index: kvm/drivers/kvm/Kconfig
===
--- kvm.orig/drivers/kvm/Kconfig2007-08-17 10:24:46.0 +0200
+++ kvm/drivers/kvm/Kconfig 2007-08-17 10:25:25.0 +0200
@@ -41,4 +41,10 @@
  Provides support for KVM on AMD processors equipped with the AMD-V
  (SVM) extensions.
 
+config GUEST_ACCOUNTING
+   bool "Virtual Machine accounting support"
+   depends on KVM
+   ---help---
+ Allows to account CPU time used by the Virtual Machines.
+
 endif # VIRTUALIZATION
Index: kvm/drivers/kvm/kvm.h
===
--- kvm.orig/drivers/kvm/kvm.h  2007-08-17 10:19:30.0 +0200
+++ kvm/drivers/kvm/kvm.h   2007-08-17 10:21:56.0 +0200
@@ -589,6 +589,19 @@
 
 int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run);
 
+#ifdef CONFIG_GUEST_ACCOUNTING
+static inline ktime_t kvm_guest_enter(void)
+{
+return ktime_get();
+}
+
+static inline void kvm_guest_exit(ktime_t enter)
+{
+ktime_t delta = ktime_sub(ktime_get(), enter);
+current->vtime = ktime_add(current->vtime, delta);
+}
+#endif
+
 static inline int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 u32 error_code)
 {
Index: kvm/drivers/kvm/svm.c
===
--- kvm.orig/drivers/kvm/svm.c  2007-08-17 10:22:07.0 +0200
+++ kvm/drivers/kvm/svm.c   2007-08-17 10:23:32.0 +0200
@@ -1392,6 +1392,9 @@
u16 gs_selector;
u16 ldt_selector;
int r;
+#ifdef CONFIG_GUEST_ACCOUNTING
+   ktime_t now;
+#endif
 
 again:
r = kvm_mmu_reload(vcpu);
@@ -1404,6 +1407,9 @@
clgi();
 
vcpu->guest_mode = 1;
+#ifdef CONFIG_GUEST_ACCOUNTING
+   now = kvm_guest_enter();
+#endif
if (vcpu->requests)
if (test_and_clear_bit(KVM_TLB_FLUSH, &vcpu->requests))
svm_flush_tlb(vcpu);
@@ -1536,6 +1542,9 @@
 #endif
: "cc", "memory" );
 
+#ifdef CONFIG_GUEST_ACCOUNTING
+   kvm_guest_exit(now);
+#endif
vcpu->guest_mode = 0;
 
if (vcpu->fpu_active) {
Index: kvm/drivers/kvm/vmx.c
===
--- kvm.orig/drivers/kvm/vmx.c  2007-08-17 10:23:36.0 +0200
+++ kvm/drivers/kvm/vmx.c   2007-08-17 10:24:37.0 +0200
@@ -2052,6 +2052,9 @@
struct vcpu_vmx *vmx = to_vmx(vcpu);
u8 fail;
int r;
+#ifdef CONFIG_GUEST_ACCOUNTING
+   ktime_t now;
+#endif
 
 preempted:
if (vcpu->guest_debug.enabled)
@@ -2078,6 +2081,9 @@
local_irq_disable();
 
vcpu->guest_mode = 1;
+#ifdef CONFIG_GUEST_ACCOUNTING
+   now = kvm_guest_enter();
+#endif
if (vcpu->requests)
if (test_and_clear_bit(KVM_TLB_FLUSH, &vcpu->requests))
vmx_flush_tlb(vcpu);
@@ -2198,6 +2204,9 @@
[cr2]"i"(offsetof(struct kvm_vcpu, cr2))
  : "cc", "memory" );
 
+#ifdef CONFIG_GUEST_ACCOUNTING
+   kvm_guest_exit(now);
+#endif
vcpu->guest_mode = 0;
local_irq_enable();

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin


Satyam Sharma wrote:



On Fri, 17 Aug 2007, Nick Piggin wrote:



Satyam Sharma wrote:

It is very obvious. msleep calls schedule() (ie. sleeps), which is
always a barrier.



Probably you didn't mean that, but no, schedule() is not barrier because
it sleeps. It's a barrier because it's invisible.



Where did I say it is a barrier because it sleeps?

It is always a barrier because, at the lowest level, schedule() (and thus
anything that sleeps) is defined to always be a barrier. Regardless of
whatever obscure means the compiler might need to infer the barrier.

In other words, you can ignore those obscure details because schedule() is
always going to have an explicit barrier in it.



The "unobvious" thing is that you wanted to know how the compiler knows
a function is a barrier -- answer is that if it does not *know* it is not
a barrier, it must assume it is a barrier.



True, that's clearly what happens here. But are you're definitely joking
that this is "obvious" in terms of code-clarity, right?



No. If you accept that barrier() is implemented correctly, and you know
that sleeping is defined to be a barrier, then its perfectly clear. You
don't have to know how the compiler "knows" that some function contains
a barrier.



Just 5 minutes back you mentioned elsewhere you like seeing lots of
explicit calls to barrier() (with comments, no less, hmm? :-)



Sure, but there are well known primitives which contain barriers, and
trivial recognisable code sequences for which you don't need comments.
waiting-loops using sleeps or cpu_relax() are prime examples.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] NOMMU: Separate out VMAs

2007-08-17 Thread Bernd Schmidt


David Howells wrote:

David Howells <[EMAIL PROTECTED]> wrote:


Yes.  I found the major leak this morning.  There may be a minor leak, but I'm
not convinced it's in the mmap stuff.  See revised patch.


Oops.  That was the old patch.  Try this one instead.



Here are some changes to make it more stable on my system.  In 
do_mmap_private, I've commented out the logic to free excess pages, as 
it fragments terribly and causes a simple

 while true; do cat /proc/buddyinfo; done
loop to go oom.  Also, I think you're freeing high-order pages unaligned 
to their order?
In shrink_vma, we must save the mm across calls to remove_vma_from_mm 
(oops when telnetting into the box).
In do_munmap, we can deal with freeing more than one vma.  I've not 
touched the rb-tree logic in the shared file case, as I have no idea 
what it's trying to do given that only exact matches are allowed.


It still does not survive my mmap stress-tester, so I'll keep looking.

Why do we need vm_regions for anonymous memory?  Wouldn't it be enough 
to just have a VMA?



Bernd
--
This footer brought to you by insane German lawmakers.
Analog Devices GmbH  Wilhelm-Wagenfeld-Str. 6  80807 Muenchen
Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368
Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif
diff --width=180 -x '*~' -x '.*' -x 'Sys*' -dru 
linux-2.6.x-daves-baseline/mm/nommu.c linux-2.6.x/mm/nommu.c
--- linux-2.6.x-daves-baseline/mm/nommu.c   2007-08-14 21:20:43.0 
+0200
+++ linux-2.6.x/mm/nommu.c  2007-08-16 19:11:29.0 +0200
@@ -922,19 +931,23 @@
total = 1 << order;
atomic_add(total, &mmap_pages_allocated);
 
-   point = len >> PAGE_SHIFT;
-   while (point < total) {
-   order = ilog2(total - point);
-   _debug("shave %u/%lu", 1 << order, total - point);
-   atomic_sub(1 << order, &mmap_pages_allocated);
-   __free_pages(pages + point, order);
-   point += 1 << order;
+   len = PAGE_SIZE << order;
+#if 0
+   while (total > (len >> PAGE_SHIFT)) {
+   unsigned long to_free;
+   order = ilog2(total - (len >> PAGE_SHIFT));
+   to_free = 1 << order;
+   _debug("shave %u/%lu", to_free, total - (len >> PAGE_SHIFT));
+   atomic_sub(to_free, &mmap_pages_allocated);
+   __free_pages(pages + total - to_free, order);
+   total -= to_free;
}
 
total = len >> PAGE_SHIFT;
for (point = 1; point < total; point++)
set_page_refcounted(&pages[point]);
-
+#endif
+   split_page(pages, order);
base = page_address(pages);
region->vm_start = vma->vm_start = (unsigned long) base;
region->vm_end   = vma->vm_end   = vma->vm_start + len;
@@ -1294,6 +1307,7 @@
  unsigned long from, unsigned long to)
 {
struct vm_region *region;
+   struct mm_struct *mm = vma->vm_mm;
 
_enter("");
 
@@ -1304,7 +1318,7 @@
vma->vm_end = from;
else
vma->vm_start = to;
-   add_vma_to_mm(vma->vm_mm, vma);
+   add_vma_to_mm(mm, vma);
 
/* cut the region down to size */
region = vma->vm_region;
@@ -1337,44 +1351,59 @@
 
_enter(",%lx,%zx", start, len);
 
-   /* find the first potentially overlapping VMA */
-   vma = find_vma(mm, start);
-   if (!vma) {
-   _leave(" = -EINVAL [no]");
-   return -EINVAL;
-   }
-
-   /* we're allowed to split an anonymous VMA but not a file-backed one */
-   if (vma->vm_file) {
-   do {
-   if (start > vma->vm_start) {
-   _leave(" = -EINVAL [miss]");
-   return -EINVAL;
-   }
-   if (end == vma->vm_end)
-   goto erase_whole_vma;
-   rb = rb_next(&vma->vm_rb);
-   vma = rb_entry(rb, struct vm_area_struct, vm_rb);
-   } while (rb);
-   _leave(" = -EINVAL [split file]");
-   return -EINVAL;
-   } else {
-   /* the region must be a subset of the VMA found */
-   if (start == vma->vm_start && end == vma->vm_end)
-   goto erase_whole_vma;
-   if (start < vma->vm_start || end > vma->vm_end) {
+   while (start < end) {
+   unsigned long this_end;
+   /* find the first potentially overlapping VMA */
+   vma = find_vma(mm, start);
+   if (!vma) {
+   _leave(" = -EINVAL [no]");
+   return -EINVAL;
+   }
+   if (start < vma->vm_start) {
_leave(" = -EINVAL [superset]");
return -EINVAL;
}
-   if (start != vma->vm_start && end != vma->vm_en

[PATCH/RFC 3/4, second shot]Introduce "account_guest_time"

2007-08-17 Thread Laurent Vivier

This is another way to compute guest time... I remove the "account modifiers"
mechanism and call directly account_guest_time() from account_system_time().
account_system_time() computes user, system and guest times according value
accumulated in vtime (a ktime_t) in task_struct by the virtual machine.
-- 
- [EMAIL PROTECTED]  --
  "Software is hard" - Donald Knuth
Index: kvm/include/linux/sched.h
===
--- kvm.orig/include/linux/sched.h  2007-08-17 10:18:53.0 +0200
+++ kvm/include/linux/sched.h   2007-08-17 12:33:22.0 +0200
@@ -1192,6 +1192,9 @@
 #ifdef CONFIG_FAULT_INJECTION
int make_it_fail;
 #endif
+#ifdef CONFIG_GUEST_ACCOUNTING
+   ktime_t vtime;
+#endif
 };
 
 /*
Index: kvm/kernel/sched.c
===
--- kvm.orig/kernel/sched.c 2007-08-17 10:18:53.0 +0200
+++ kvm/kernel/sched.c  2007-08-17 12:33:07.0 +0200
@@ -3233,6 +3233,37 @@
cpustat->user = cputime64_add(cpustat->user, tmp);
 }
 
+#ifdef CONFIG_GUEST_ACCOUNTING
+/*
+ * Account guest time to a process
+ * @p: the process that the cpu time gets accounted to
+ * @cputime: the cpu time spent in kernel space since the last update
+ */
+
+static cputime_t account_guest_time(struct task_struct *p, cputime_t cputime)
+{
+   struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+   ktime_t kmsec = ktime_set(0, NSEC_PER_MSEC);
+   cputime_t cmsec = msecs_to_cputime(1);
+
+   while ((ktime_to_ns(p->vtime) >= NSEC_PER_MSEC) &&
+   (cputime_to_msecs(cputime) >= 1)) {
+   p->vtime = ktime_sub(p->vtime, kmsec);
+   p->utime = cputime_add(p->utime, cmsec);
+   p->gtime = cputime_add(p->gtime, cmsec);
+
+   cpustat->guest = cputime64_add(cpustat->guest,
+   cputime_to_cputime64(cmsec));
+   cpustat->user = cputime64_add(cpustat->user,
+   cputime_to_cputime64(cmsec));
+
+   cputime = cputime_sub(cputime, cmsec);
+   }
+
+   return cputime;
+}
+#endif
+
 /*
  * Account system cpu time to a process.
  * @p: the process that the cpu time gets accounted to
@@ -3246,6 +3277,10 @@
struct rq *rq = this_rq();
cputime64_t tmp;
 
+#ifdef CONFIG_GUEST_ACCOUNTING
+   cputime = account_guest_time(p, cputime);
+#endif
+
p->stime = cputime_add(p->stime, cputime);
 
/* Add system time to cpustat. */

Re: [PATCH] Make checkpatch rant about trailing ; at the end of "if" expr

2007-08-17 Thread Jan Engelhardt


On Aug 16 2007 10:21, Andy Whitcroft wrote:
>> +if ($line =~ /\bif\s*\([^\)]*\)\s*\;/) {
>
>Heh, you are the second person to suggest this check today, do I detect
>some ripped out hair due to one of these!
>
>I've taken this idea and expanded it to cover if, for and while which
>can all suffer from this.  Using the relative indent to work out which
>are valid combinations:

But. The above regex does not seem to handle

if ((a = b));
oops;

I have tried to come up with a superduper regex that handles multiple 
(), but my regex fu seems to stop above two pairs of ().

#!/usr/bin/perl

@check = (
q"if ((ptr = malloc(bong, GFP)) == NULL)  ; (oopsie) ;",
q"if ((ptr = malloc(bong, GFP)) == NULL)(alright);",
q"if (  ({ bool evil = (true); evil; }) ) ; (oopsie) ;",
q"if (  ({ bool evil = (true); evil; }) )   (alright);",
q"if((()));",
);

my $r = qr/\s*\(\s*(??{$r})?\s*\)\s*|\s*\(\s*[^()]+\s*\)\s*/;

foreach (@check) {
if ($_ =~ /(if|for|while)$r;/) {
print "ok $_\n";
}
}


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin


Satyam Sharma wrote:



On Fri, 17 Aug 2007, Nick Piggin wrote:


I think they would both be equally ugly,



You think both these are equivalent in terms of "looks":

|
while (!atomic_read(&v)) {  |   while (!atomic_read_xxx(&v)) {
... |   ...
cpu_relax_no_barrier(); |   cpu_relax_no_barrier();
order_atomic(&v);   |   }
}   |

(where order_atomic() is an atomic_t
specific wrapper as you mentioned below)

?



I think the LHS is better if your atomic_read_xxx primitive is using the
crazy one-sided barrier, because the LHS code you immediately know what
barriers are happening, and with the RHS you have to look at the 
atomic_read_xxx

definition.

If your atomic_read_xxx implementation was more intuitive, then both are
pretty well equal. More lines != ugly code.



but the atomic_read_volatile
variant would be more prone to subtle bugs because of the weird
implementation.



What bugs?



You can't think for yourself? Your atomic_read_volatile contains a compiler
barrier to the atomic variable before the load. 2 such reads from different
locations look like this:

asm volatile("" : "+m" (v1));
atomic_read(&v1);
asm volatile("" : "+m" (v2));
atomic_read(&v2);

Which implies that the load of v1 can be reordered to occur after the load
of v2. Bet you didn't expect that?


Secondly, what sort of code would do such a thing?



See the nodemgr_host_thread() that does something similar, though not
exactly same.



I'm sorry, all this waffling about made up code which might do this and
that is just a waste of time. Seriously, the thread is bloated enough
and never going to get anywhere with all this handwaving. If someone is
saving up all the really real and actually good arguments for why we
must have a volatile here, now is the time to use them.


and have barriers both before and after the memory operation,



How could that lead to bugs? (if you can point to existing code,
but just some testcase / sample code would be fine as well).



See above.


As I said, barrier() is too heavy-handed.



Typo. I meant: defined for a single memory location (ie. order(x)).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFHelp: Splitting MAINTAINERS into maintainers/* and Makefile/Kconfig support

2007-08-17 Thread Joe Perches

On Fri, 2007-08-17 at 02:25 -0400, Chris Snook wrote:
> The whole point of MAINTAINERS is to have one central repository for this 
> information, instead of scattering it throughout the various source files.  
> If 
> that file is getting too unwieldy (and I don't think it is) then I could 
> understand splitting it up hierarchically, for example having a 
> drivers/net/MAINTAINERS that listed the info for all the net drivers.

The individual MAINTAINERS files eliminates what Linus
described as their "hotness".
No shared updates by multiple parties.

> What you're suggesting is a less efficient equivalent to putting the info 
> directly into the source files.

I believe that wrong.

Maintainer patterns frequently look like:

F:  arch/i386/kernel/cpu/cpufreq/
F:  drivers/cpufreq/
F:  include/linux/cpufreq.h

If in source, this would currently require 21 + 12 + 1 
modifications instead of 1.

> If that approach was enough to make people 
> happy, we wouldn't have MAINTAINERS to begin with.

I think the insertion of maintainers into source
itself is wrong.  It's freeform, error prone and
requires significant modifications to source files
as maintainers come and go.

> Perhaps with a little automation it could be revived,

Which is the help I'm looking for.

Can someone please help here on ideas or implementation
adding a makefile target for MAINTAINERS from files
in a specific subdirectory?

> though I think that adding a path pattern 
> removes the need, while keeping it easier to parse by scripts.
> I appreciate the effort to make MAINTAINERS more useful,
> but please don't add  another 600 files to the tree.

In the distributed form, you'll still end up with
~400 new files spread all over the tree.

Either way, you'll have hundreds of files.

$ grep "^F:" MAINTAINERS |  sed -e "s/[A-Za-z0-9\_\*\.\-]*$//" | sort |
uniq | wc -l
415

Centralized as maintainers/* or distributed as ../../../Maintainers

Pick one, I don't much care, but I'm still looking for
Makefile/KConfig help reassembling it into a single
MAINTAINERS block similar to the current form.

Help?

cheers, Joe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Corrupted filesystem with new Firewire stack

2007-08-17 Thread Martin K. Petersen

> "Stefan" == Stefan Richter <[EMAIL PROTECTED]> writes:

Stefan> There were some similar reports involving that "status write
Stefan> for unknown orb".  I haven't found a way to reproduce it; I
Stefan> noticed it only once in the logs here so far.

I get those all the time.  Just do heavy ext3 I/O to the drive.

Happens here on both a G4 and an intel Mini.  Both running FC7.

Aug 17 08:24:08 mini kernel: firewire_sbp2: status write for unknown orb
Aug 17 08:25:08 mini kernel: firewire_sbp2: sbp2_scsi_abort
Aug 17 08:26:36 mini kernel: firewire_sbp2: status write for unknown orb
Aug 17 08:27:36 mini kernel: firewire_sbp2: sbp2_scsi_abort
Aug 17 08:33:51 mini kernel: firewire_sbp2: status write for unknown orb
Aug 17 08:34:51 mini kernel: firewire_sbp2: sbp2_scsi_abort

Lacie drive in both cases.

-- 
Martin K. Petersen  http://mkp.net/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 000 of 6] A few block-layer tidy-up patches.

2007-08-17 Thread Jens Axboe

On Fri, Aug 17 2007, Neil Brown wrote:
> On Friday August 17, [EMAIL PROTECTED] wrote:
> > > > Please inspect the #block-2.6.24 branch to see the result.
> > > 
> > > I don't know where to look for this.  I checked
> > >   http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git
> > > but they don't seem to be there.
> > > ??
> > 
> > That's where it is, but the kernel.org mirroring is just horrible slow.
> > Just checked now and it's there.
> 
> I discovered I was looking in the wrong place - not being very
> familiar with git terminology.
> I found them and it looks right.

Ah, good.

> I had a bit of a look at removing bio_data and ->buffer ...  the
> usages outside of drivers/ide are fairly easy to deal with - I might
> send a patch for that.  The drivers/ide stuff looks like a fairly
> substantial rewrite is in order.
> e.g.  idefloppy_packet_command_s seems to duplicate a lot of 
>   fields from 'struct request', and it should probably use the request
>   struct directly.

We can do it bits at the time, the triviel ones first. I have some
experience with drivers/ide/ myself, so I can attempt to tackle some of
that myself.

> But a number of times ->buffer points to ->cmd, and there is no bio.
> I guess we should use bio_map_kern to make a bio?

It points to ->cmd?! But yes, generally things just need to be
converted to map the data to a bio with bio_map_kern() (or
bio_map_user(), where appropriate).

> I'll see if I can come up with something testing it might be
> awkward.  I have an ide cdrom I can test on.  Maybe an ide disk,, but
> not an ide floppy :-)

I can help with that as well :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin


Satyam Sharma wrote:



On Fri, 17 Aug 2007, Nick Piggin wrote:



Because they should be thinking about them in terms of barriers, over
which the compiler / CPU is not to reorder accesses or cache memory
operations, rather than "special" "volatile" accesses.



This is obviously just a taste thing. Whether to have that forget(x)
barrier as something author should explicitly sprinkle appropriately
in appropriate places in the code by himself or use a primitive that
includes it itself.



That's not obviously just taste to me. Not when the primitive has many
(perhaps, the majority) of uses that do not require said barriers. And
this is not solely about the code generation (which, as Paul says, is
relatively minor even on x86). I prefer people to think explicitly
about barriers in their lockless code.



I'm not saying "taste matters aren't important" (they are), but I'm really
skeptical if most folks would find the former tasteful.



So I /do/ have better taste than most folks? Thanks! :-)



And by the way, the point is *also* about the fact that cpu_relax(), as
of today, implies a full memory clobber, which is not what a lot of such
loops want. (due to stuff mentioned elsewhere, summarized in that summary)


That's not the point,



That's definitely the point, why not. This is why "barrier()", being
heavy-handed, is not the best option.



That is _not_ the point (of why a volatile atomic_read is good) because 
there
has already been an alternative posted that better conforms with Linux 
barrier
API and is much more widely useful and more usable. If you are so 
worried about
barrier() being too heavyweight, then you're off to a poor start by 
wanting to

add a few K of kernel text by making atomic_read volatile.



because as I also mentioned, the logical extention
to Linux's barrier API to handle this is the order(x) macro. Again, not
special volatile accessors.



Sure, that forget(x) macro _is_ proposed to be made part of the generic
API. Doesn't explain why not to define/use primitives that has volatility
semantics in itself, though (taste matters apart).



If you follow the discussion You were thinking of a reason why the
semantics *should* be changed or added, and I was rebutting your argument
that it must be used when a full barrier() is too heavy (ie. by pointing
out that order() has superior semantics anyway).

Why do I keep repeating the same things? I'll not continue bloating this
thread until a new valid point comes up...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma

On Fri, 17 Aug 2007, Nick Piggin wrote:

> Satyam Sharma wrote:
> > On Fri, 17 Aug 2007, Nick Piggin wrote:
> > > Satyam Sharma wrote:
> > > 
> > > It is very obvious. msleep calls schedule() (ie. sleeps), which is
> > > always a barrier.
> > 
> > Probably you didn't mean that, but no, schedule() is not barrier because
> > it sleeps. It's a barrier because it's invisible.
> 
> Where did I say it is a barrier because it sleeps?

Just below. What you wrote:

> It is always a barrier because, at the lowest level, schedule() (and thus
> anything that sleeps) is defined to always be a barrier.

"It is always a barrier because, at the lowest level, anything that sleeps
is defined to always be a barrier".

> Regardless of
> whatever obscure means the compiler might need to infer the barrier.
> 
> In other words, you can ignore those obscure details because schedule() is
> always going to have an explicit barrier in it.

I didn't quite understand what you said here, so I'll tell what I think:

* foo() is a compiler barrier if the definition of foo() is invisible to
  the compiler at a callsite.

* foo() is also a compiler barrier if the definition of foo() includes
  a barrier, and it is inlined at the callsite.

If the above is wrong, or if there's something else at play as well,
do let me know.

> > > The "unobvious" thing is that you wanted to know how the compiler knows
> > > a function is a barrier -- answer is that if it does not *know* it is not
> > > a barrier, it must assume it is a barrier.
> > 
> > True, that's clearly what happens here. But are you're definitely joking
> > that this is "obvious" in terms of code-clarity, right?
> 
> No. If you accept that barrier() is implemented correctly, and you know
> that sleeping is defined to be a barrier,

Curiously, that's the second time you've said "sleeping is defined to
be a (compiler) barrier". How does the compiler even know if foo() is
a function that "sleeps"? Do compilers have some notion of "sleeping"
to ensure they automatically assume a compiler barrier whenever such
a function is called? Or are you saying that the compiler can see the
barrier() inside said function ... nopes, you're saying quite the
opposite below.

> then its perfectly clear. You
> don't have to know how the compiler "knows" that some function contains
> a barrier.

I think I do, why not? Would appreciate if you could elaborate on this.

Satyam
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Marvell 88E8056 gigabit ethernet controller

2007-08-17 Thread Kevin E

Hi all,

I've read where the onboard Marvell lan controller on
some Gigabyte boards don't work.  I've got two systems
using the same Gigabyte board, on one the LAN works on
the other it dies like described by others.  Here's
the systems:


Working system:
Gigabyte 965P-DS3 rev 3.3  (BIOS F10)
Core2 Q6600
2GB Corsair XMS2 memory
kernel 2.6.22.3

lspci for LAN controller:
04:00.0 Ethernet controller: Marvell Technology Group
Ltd. 88E8056 PCI-E Gigabit Ethernet Controller (rev
14)


Broken system:
Gigabyte 965P-DS3 rev 3.3  (BIOS F10)
Core2 E4400
2GB Corsair XMS2 memory
kernel 2.6.22.3

lspci for LAN controller:
03:00.0 Ethernet controller: Marvell Technology Group
Ltd. Unknown device 4364 (rev 12)


The BIOS for the two systems are setup the same and
the config for the kernels are the same too.  I've
actually tried taking the kernel from the working
system and booting it on the broken one but still the
LAN dies after a couple of seconds.  The working
system has one card plugged in (nvidia based PCI-X
video card), I've taken that card and plugged into the
broken system, booted the same kernel, and it still
dies after a while.

I will gladly provide any info needed if it can help
in getting this chipset working on the Gigabyte
boards.

Thanks,
Kevin


   

Be a better Heartthrob. Get better relationship answers from someone who knows. 
Yahoo! Answers - Check it out. 
http://answers.yahoo.com/dir/?link=list&sid=396545433
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma

On Fri, 17 Aug 2007, Nick Piggin wrote:

> Satyam Sharma wrote:
> [...]
> > You think both these are equivalent in terms of "looks":
> > 
> > |
> > while (!atomic_read(&v)) {  |   while (!atomic_read_xxx(&v)) {
> > ... |   ...
> > cpu_relax_no_barrier(); |
> > cpu_relax_no_barrier();
> > order_atomic(&v);   |   }
> > }   |
> > 
> > (where order_atomic() is an atomic_t
> > specific wrapper as you mentioned below)
> > 
> > ?
> 
> I think the LHS is better if your atomic_read_xxx primitive is using the
> crazy one-sided barrier,
  ^

I'd say it's purposefully one-sided.

> because the LHS code you immediately know what
> barriers are happening, and with the RHS you have to look at the
> atomic_read_xxx
> definition.

No. As I said, the _xxx (whatever the heck you want to name it as) should
give the same heads-up that your "order_atomic" thing is supposed to give.

> If your atomic_read_xxx implementation was more intuitive, then both are
> pretty well equal. More lines != ugly code.
> 
> > [...]
> > What bugs?
> 
> You can't think for yourself? Your atomic_read_volatile contains a compiler
> barrier to the atomic variable before the load. 2 such reads from different
> locations look like this:
> 
> asm volatile("" : "+m" (v1));
> atomic_read(&v1);
> asm volatile("" : "+m" (v2));
> atomic_read(&v2);
> 
> Which implies that the load of v1 can be reordered to occur after the load
> of v2.

And how would that be a bug? (sorry, I really can't think for myself)

> > > Secondly, what sort of code would do such a thing?
> > 
> > See the nodemgr_host_thread() that does something similar, though not
> > exactly same.
> 
> I'm sorry, all this waffling about made up code which might do this and
> that is just a waste of time.

First, you could try looking at the code.

And by the way, as I've already said (why do *require* people to have to
repeat things to you?) this isn't even about only existing code.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH/RFC 3/4]Introduce "account modifiers" mechanism

2007-08-17 Thread Avi Kivity

Laurent Vivier wrote:
>> 
>>> - remove PATCH 3, and add in task_struct a "ktime vtime" where we accumulate
>>> guest time (by calling something like guest_enter() and guest_exit() from 
>>> the
>>> virtualization engine), and when in account_system_time() we have cputime >
>>> vtime we substrate vtime from cputime and add vtime to user time and guest 
>>> time.
>>> But doing like this we freeze in kernel/sched.c the link between system 
>>> time,
>>> user time and guest time (i.e. system time = system time - vtime, user time 
>>> =
>>> user time + vtime and guest time = guest time + vtime).
>>>   
>> Actually, I think we can set a per-cpu "in_guest" flag for the scheduler
>> code, which then knows to add the tick to the guest time.  That seems
>> the simplest possible solution.
>>
>> lguest or kvm would set the flag before running the guest (which is done
>> with preempt disabled or using preemption hooks), and reset it
>> afterwards.
>>
>> Thoughts?
>> 
>
> It was my first attempt (except I didn't have a per-cpu flag, but a per-task
> flag), it's not visible but I love simplicity... ;-)
>
> A KVM VCPU is stopped by preemption, so when we enter in scheduler we have
> exited from VCPU and thus this flags is off (so we account 0 to the guest). 
> What
> I did then is "set the flag on when we enter in the VCPU, and
> "account_system_time()" sets the flag off when it adds this timeslice to 
> cpustat
> (and compute correctly guest, user, system time). But I didn't like this idea
> because all code executed after we entered in the VCPU is accounted to the 
> guest
> until we have an account_system_time() and I suppose we can have real system
> time in this part. And I guess a VCPU can be less than 1 ms (unit of cputime) 
> in
> a timeslice.
>
> So ? What's best ?
>   

The normal user/system accounting has the same issue, no?  Whereever we
happen to land (kernel or user) gets the whole tick.

So I think it is okay to have the same limitation for guest time.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin


Satyam Sharma wrote:



On Fri, 17 Aug 2007, Nick Piggin wrote:



Satyam Sharma wrote:


On Fri, 17 Aug 2007, Nick Piggin wrote:


Satyam Sharma wrote:

It is very obvious. msleep calls schedule() (ie. sleeps), which is
always a barrier.


Probably you didn't mean that, but no, schedule() is not barrier because
it sleeps. It's a barrier because it's invisible.


Where did I say it is a barrier because it sleeps?



Just below. What you wrote:



It is always a barrier because, at the lowest level, schedule() (and thus
anything that sleeps) is defined to always be a barrier.



"It is always a barrier because, at the lowest level, anything that sleeps
is defined to always be a barrier".



... because it must call schedule and schedule is a barrier.



Regardless of
whatever obscure means the compiler might need to infer the barrier.

In other words, you can ignore those obscure details because schedule() is
always going to have an explicit barrier in it.



I didn't quite understand what you said here, so I'll tell what I think:

* foo() is a compiler barrier if the definition of foo() is invisible to
 the compiler at a callsite.

* foo() is also a compiler barrier if the definition of foo() includes
 a barrier, and it is inlined at the callsite.

If the above is wrong, or if there's something else at play as well,
do let me know.



Right.



The "unobvious" thing is that you wanted to know how the compiler knows
a function is a barrier -- answer is that if it does not *know* it is not
a barrier, it must assume it is a barrier.


True, that's clearly what happens here. But are you're definitely joking
that this is "obvious" in terms of code-clarity, right?


No. If you accept that barrier() is implemented correctly, and you know
that sleeping is defined to be a barrier,



Curiously, that's the second time you've said "sleeping is defined to
be a (compiler) barrier".



_In Linux,_ sleeping is defined to be a compiler barrier.


How does the compiler even know if foo() is
a function that "sleeps"? Do compilers have some notion of "sleeping"
to ensure they automatically assume a compiler barrier whenever such
a function is called? Or are you saying that the compiler can see the
barrier() inside said function ... nopes, you're saying quite the
opposite below.



You're getting too worried about the compiler implementation. Start
by assuming that it does work ;)



then its perfectly clear. You
don't have to know how the compiler "knows" that some function contains
a barrier.



I think I do, why not? Would appreciate if you could elaborate on this.



If a function is not completely visible to the compiler (so it can't
determine whether a barrier could be in it or not), then it must always
assume it will contain a barrier so it always does the right thing.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH/RFC 4/4, second shot]KVM uses "account_guest_time()"

2007-08-17 Thread Avi Kivity

Laurent Vivier wrote:
> KVM updates vtime in task_struct to allow account_guest_time() to modify user,
> system and guest time in cpustat accordingly.
>   

> --- kvm.orig/drivers/kvm/Kconfig  2007-08-17 10:24:46.0 +0200
> +++ kvm/drivers/kvm/Kconfig   2007-08-17 10:25:25.0 +0200
> @@ -41,4 +41,10 @@
> Provides support for KVM on AMD processors equipped with the AMD-V
> (SVM) extensions.
>  
> +config GUEST_ACCOUNTING
> + bool "Virtual Machine accounting support"
> + depends on KVM
> + ---help---
> +   Allows to account CPU time used by the Virtual Machines.
> +


Other way round.  In the patch that adds account_guest_time(), have a
CONFIG_GUEST_ACCOUNTING (defaulting to n, with no description, help, or
dependencies.  Then, CONFIG_KVM can select GUEST_ACCOUNTING.

The advantages of this are:
- the puppyvisor can also select this if it so wishes
- we don't have core code reference some obscure module

CONFIG_PREEMPT_NOTIFIERS does the same thing.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH/RFC 3/4]Introduce "account modifiers" mechanism

2007-08-17 Thread Laurent Vivier

Avi Kivity wrote:
> Laurent Vivier wrote:
>>> 
 - remove PATCH 3, and add in task_struct a "ktime vtime" where we 
 accumulate
 guest time (by calling something like guest_enter() and guest_exit() from 
 the
 virtualization engine), and when in account_system_time() we have cputime >
 vtime we substrate vtime from cputime and add vtime to user time and guest 
 time.
 But doing like this we freeze in kernel/sched.c the link between system 
 time,
 user time and guest time (i.e. system time = system time - vtime, user 
 time =
 user time + vtime and guest time = guest time + vtime).
   
>>> Actually, I think we can set a per-cpu "in_guest" flag for the scheduler
>>> code, which then knows to add the tick to the guest time.  That seems
>>> the simplest possible solution.
>>>
>>> lguest or kvm would set the flag before running the guest (which is done
>>> with preempt disabled or using preemption hooks), and reset it
>>> afterwards.
>>>
>>> Thoughts?
>>> 
>> It was my first attempt (except I didn't have a per-cpu flag, but a per-task
>> flag), it's not visible but I love simplicity... ;-)
>>
>> A KVM VCPU is stopped by preemption, so when we enter in scheduler we have
>> exited from VCPU and thus this flags is off (so we account 0 to the guest). 
>> What
>> I did then is "set the flag on when we enter in the VCPU, and
>> "account_system_time()" sets the flag off when it adds this timeslice to 
>> cpustat
>> (and compute correctly guest, user, system time). But I didn't like this idea
>> because all code executed after we entered in the VCPU is accounted to the 
>> guest
>> until we have an account_system_time() and I suppose we can have real system
>> time in this part. And I guess a VCPU can be less than 1 ms (unit of 
>> cputime) in
>> a timeslice.
>>
>> So ? What's best ?
>>   
> 
> The normal user/system accounting has the same issue, no?  Whereever we
> happen to land (kernel or user) gets the whole tick.

Yes... but perhaps I should rewrite this too ;-)

> So I think it is okay to have the same limitation for guest time.

OK, so we can go back to my first patch.
Who can decide to introduce this into the kernel ?

Laurent
-- 
- [EMAIL PROTECTED]  --
  "Software is hard" - Donald Knuth



signature.asc
Description: OpenPGP digital signature

[PATCH -mm] x86_64: Save registers in saved_context during suspend and hibernation

2007-08-17 Thread Rafael J. Wysocki

From: Rafael J. Wysocki <[EMAIL PROTECTED]>

During hibernation and suspend on x86_64 save CPU registers in the saved_context
structure rather than in a handful of separate variables.

Signed-off-by: Rafael J. Wysocki <[EMAIL PROTECTED]>
Looks-ok-to: Pavel Machek <[EMAIL PROTECTED]>
---
 arch/x86_64/kernel/acpi/wakeup.S |  101 ---
 arch/x86_64/kernel/asm-offsets.c |   28 ++
 arch/x86_64/kernel/suspend.c |6 --
 arch/x86_64/kernel/suspend_asm.S |   72 ++-
 include/asm-x86_64/suspend.h |   23 ++--
 5 files changed, 125 insertions(+), 105 deletions(-)

Index: linux-2.6.23-rc3/arch/x86_64/kernel/asm-offsets.c
===
--- linux-2.6.23-rc3.orig/arch/x86_64/kernel/asm-offsets.c
+++ linux-2.6.23-rc3/arch/x86_64/kernel/asm-offsets.c
@@ -76,6 +76,34 @@ int main(void)
DEFINE(pbe_orig_address, offsetof(struct pbe, orig_address));
DEFINE(pbe_next, offsetof(struct pbe, next));
BLANK();
+#define ENTRY(entry) DEFINE(pt_regs_ ## entry, offsetof(struct pt_regs, entry))
+   ENTRY(rbx);
+   ENTRY(rbx);
+   ENTRY(rcx);
+   ENTRY(rdx);
+   ENTRY(rsp);
+   ENTRY(rbp);
+   ENTRY(rsi);
+   ENTRY(rdi);
+   ENTRY(r8);
+   ENTRY(r9);
+   ENTRY(r10);
+   ENTRY(r11);
+   ENTRY(r12);
+   ENTRY(r13);
+   ENTRY(r14);
+   ENTRY(r15);
+   ENTRY(eflags);
+   BLANK();
+#undef ENTRY
+#define ENTRY(entry) DEFINE(saved_context_ ## entry, offsetof(struct 
saved_context, entry))
+   ENTRY(cr0);
+   ENTRY(cr2);
+   ENTRY(cr3);
+   ENTRY(cr4);
+   ENTRY(cr8);
+   BLANK();
+#undef ENTRY
DEFINE(TSS_ist, offsetof(struct tss_struct, ist));
BLANK();
DEFINE(crypto_tfm_ctx_offset, offsetof(struct crypto_tfm, __crt_ctx));
Index: linux-2.6.23-rc3/include/asm-x86_64/suspend.h
===
--- linux-2.6.23-rc3.orig/include/asm-x86_64/suspend.h
+++ linux-2.6.23-rc3/include/asm-x86_64/suspend.h
@@ -3,6 +3,9 @@
  * Based on code
  * Copyright 2001 Patrick Mochel <[EMAIL PROTECTED]>
  */
+#ifndef __ASM_X86_64_SUSPEND_H
+#define __ASM_X86_64_SUSPEND_H
+
 #include 
 #include 
 
@@ -12,8 +15,9 @@ arch_prepare_suspend(void)
return 0;
 }
 
-/* Image of the saved processor state. If you touch this, fix acpi_wakeup.S. */
+/* Image of the saved processor state. If you touch this, fix acpi/wakeup.S. */
 struct saved_context {
+   struct pt_regs regs;
u16 ds, es, fs, gs, ss;
unsigned long gs_base, gs_kernel_base, fs_base;
unsigned long cr0, cr2, cr3, cr4, cr8;
@@ -29,27 +33,14 @@ struct saved_context {
unsigned long tr;
unsigned long safety;
unsigned long return_address;
-   unsigned long eflags;
 } __attribute__((packed));
 
-/* We'll access these from assembly, so we'd better have them outside struct */
-extern unsigned long saved_context_eax, saved_context_ebx, saved_context_ecx, 
saved_context_edx;
-extern unsigned long saved_context_esp, saved_context_ebp, saved_context_esi, 
saved_context_edi;
-extern unsigned long saved_context_r08, saved_context_r09, saved_context_r10, 
saved_context_r11;
-extern unsigned long saved_context_r12, saved_context_r13, saved_context_r14, 
saved_context_r15;
-extern unsigned long saved_context_eflags;
-
 #define loaddebug(thread,register) \
set_debugreg((thread)->debugreg##register, register)
 
 extern void fix_processor_context(void);
 
-extern unsigned long saved_rip;
-extern unsigned long saved_rsp;
-extern unsigned long saved_rbp;
-extern unsigned long saved_rbx;
-extern unsigned long saved_rsi;
-extern unsigned long saved_rdi;
-
 /* routines for saving/restoring kernel state */
 extern int acpi_save_state_mem(void);
+
+#endif /* __ASM_X86_64_SUSPEND_H */
Index: linux-2.6.23-rc3/arch/x86_64/kernel/suspend.c
===
--- linux-2.6.23-rc3.orig/arch/x86_64/kernel/suspend.c
+++ linux-2.6.23-rc3/arch/x86_64/kernel/suspend.c
@@ -19,12 +19,6 @@ extern const void __nosave_begin, __nosa
 
 struct saved_context saved_context;
 
-unsigned long saved_context_eax, saved_context_ebx, saved_context_ecx, 
saved_context_edx;
-unsigned long saved_context_esp, saved_context_ebp, saved_context_esi, 
saved_context_edi;
-unsigned long saved_context_r08, saved_context_r09, saved_context_r10, 
saved_context_r11;
-unsigned long saved_context_r12, saved_context_r13, saved_context_r14, 
saved_context_r15;
-unsigned long saved_context_eflags;
-
 void __save_processor_state(struct saved_context *ctxt)
 {
kernel_fpu_begin();
Index: linux-2.6.23-rc3/arch/x86_64/kernel/suspend_asm.S
===
--- linux-2.6.23-rc3.orig/arch/x86_64/kernel/suspend_asm.S
+++ linux-2.6.23-rc3/arch/x86_64/kernel/suspend_asm.S
@@ -17,24 +17,24 @@
 #include

[PATCH -mm] PNP: Make pnpacpi_suspend handle errors

2007-08-17 Thread Rafael J. Wysocki

From: Rafael J. Wysocki <[EMAIL PROTECTED]>

pnpacpi_suspend() doesn't check the result returned by
acpi_pm_device_sleep_state() before passing it to acpi_bus_set_power(), which
may not be desirable.  Make it select the target power state of the device
using its second argument if acpi_pm_device_sleep_state() fails.

Signed-off-by: Rafael J. Wysocki <[EMAIL PROTECTED]>
Looks-ok-to: Pavel Machek <[EMAIL PROTECTED]>
---
 drivers/pnp/pnpacpi/core.c |   15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

Index: linux-2.6.23-rc3/drivers/pnp/pnpacpi/core.c
===
--- linux-2.6.23-rc3.orig/drivers/pnp/pnpacpi/core.c2007-08-15 
14:56:58.0 +0200
+++ linux-2.6.23-rc3/drivers/pnp/pnpacpi/core.c 2007-08-15 15:08:36.0 
+0200
@@ -130,11 +130,16 @@ static int pnpacpi_disable_resources(str
 #ifdef CONFIG_ACPI_SLEEP
 static int pnpacpi_suspend(struct pnp_dev *dev, pm_message_t state)
 {
-   return acpi_bus_set_power((acpi_handle) dev->data,
- acpi_pm_device_sleep_state(&dev->dev,
-device_may_wakeup
-(&dev->dev),
-NULL));
+   int power_state;
+
+   power_state = acpi_pm_device_sleep_state(&dev->dev,
+   device_may_wakeup(&dev->dev),
+   NULL);
+   if (power_state < 0)
+   power_state = (state.event == PM_EVENT_ON) ?
+   ACPI_STATE_D0 : ACPI_STATE_D3;
+
+   return acpi_bus_set_power((acpi_handle) dev->data, power_state);
 }
 
 static int pnpacpi_resume(struct pnp_dev *dev)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH/RFC 4/4, second shot]KVM uses "account_guest_time()"

2007-08-17 Thread Laurent Vivier

Avi Kivity wrote:
> Laurent Vivier wrote:
>> KVM updates vtime in task_struct to allow account_guest_time() to modify 
>> user,
>> system and guest time in cpustat accordingly.
>>   
> 
>> --- kvm.orig/drivers/kvm/Kconfig 2007-08-17 10:24:46.0 +0200
>> +++ kvm/drivers/kvm/Kconfig  2007-08-17 10:25:25.0 +0200
>> @@ -41,4 +41,10 @@
>>Provides support for KVM on AMD processors equipped with the AMD-V
>>(SVM) extensions.
>>  
>> +config GUEST_ACCOUNTING
>> +bool "Virtual Machine accounting support"
>> +depends on KVM
>> +---help---
>> +  Allows to account CPU time used by the Virtual Machines.
>> +
> 
> 
> Other way round.  In the patch that adds account_guest_time(), have a
> CONFIG_GUEST_ACCOUNTING (defaulting to n, with no description, help, or
> dependencies.  Then, CONFIG_KVM can select GUEST_ACCOUNTING.

I was wondering in which Kconfig I can put it...

> The advantages of this are:
> - the puppyvisor can also select this if it so wishes
> - we don't have core code reference some obscure module

I agree.

> CONFIG_PREEMPT_NOTIFIERS does the same thing.

I saw

Laurent
-- 
- [EMAIL PROTECTED]  --
  "Software is hard" - Donald Knuth



signature.asc
Description: OpenPGP digital signature

Re: [patch] encapsulate uevent()/add_uevent_var() buffer handling

2007-08-17 Thread Cornelia Huck

On Tue, 14 Aug 2007 15:15:12 +0200,
Kay Sievers <[EMAIL PROTECTED]> wrote:

> From: Kay Sievers <[EMAIL PROTECTED]>
> Subject: Driver core: change add_uevent_var to use a struct

This still needs some (trivial) s390 fixes:

Signed-off-by: Cornelia Huck <[EMAIL PROTECTED]>

---
 drivers/s390/cio/device.c|2 --
 drivers/s390/crypto/ap_bus.c |2 +-
 2 files changed, 1 insertion(+), 3 deletions(-)

--- linux-2.6.orig/drivers/s390/crypto/ap_bus.c
+++ linux-2.6/drivers/s390/crypto/ap_bus.c
@@ -458,7 +458,7 @@ static int ap_bus_match(struct device *d
  * uevent function for AP devices. It sets up a single environment
  * variable DEV_TYPE which contains the hardware device type.
  */
-static int ap_uevent (struct device *dev, struct kobj_uevent_env)
+static int ap_uevent(struct device *dev, struct kobj_uevent_env *env)
 {
struct ap_device *ap_dev = to_ap_dev(dev);
int retval = 0;
--- linux-2.6.orig/drivers/s390/cio/device.c
+++ linux-2.6/drivers/s390/cio/device.c
@@ -82,8 +82,6 @@ static int ccw_uevent(struct device *dev
 {
struct ccw_device *cdev = to_ccwdev(dev);
struct ccw_device_id *id = &(cdev->id);
-   int i = 0;
-   int len = 0;
int ret;
char modalias_buf[30];
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH/RFC 3/4, second shot]Introduce "account_guest_time"

2007-08-17 Thread Avi Kivity

Laurent Vivier wrote:
> This is another way to compute guest time... I remove the "account modifiers"
> mechanism and call directly account_guest_time() from account_system_time().
> account_system_time() computes user, system and guest times according value
> accumulated in vtime (a ktime_t) in task_struct by the virtual machine.
>   


> @@ -3246,6 +3277,10 @@
>   struct rq *rq = this_rq();
>   cputime64_t tmp;
>  
> +#ifdef CONFIG_GUEST_ACCOUNTING
> + cputime = account_guest_time(p, cputime);
> +#endif
> +
>   p->stime = cputime_add(p->stime, cputime);
>  
>   /* Add system time to cpustat. */


In order to reduce the impact on whatever function this is in (use diff
-p please), you can always have a definition of account_guest_time:

#else

static cputime_t account_guest_time(struct task_struct *p, cputime_t cputime)
{
return cputime;
}

#endif


This way the #ifdef/#endif is not necessary when calling it.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma

On Fri, 17 Aug 2007, Nick Piggin wrote:

> Satyam Sharma wrote:
> 
> > On Fri, 17 Aug 2007, Nick Piggin wrote:
> > 
> > > Because they should be thinking about them in terms of barriers, over
> > > which the compiler / CPU is not to reorder accesses or cache memory
> > > operations, rather than "special" "volatile" accesses.
> > 
> > This is obviously just a taste thing. Whether to have that forget(x)
> > barrier as something author should explicitly sprinkle appropriately
> > in appropriate places in the code by himself or use a primitive that
> > includes it itself.
> 
> That's not obviously just taste to me. Not when the primitive has many
> (perhaps, the majority) of uses that do not require said barriers. And
> this is not solely about the code generation (which, as Paul says, is
> relatively minor even on x86).

See, you do *require* people to have to repeat the same things to you!

As has been written about enough times already, and if you followed the
discussion on this thread, I am *not* proposing that atomic_read()'s
semantics be changed to have any extra barriers. What is proposed is a
different atomic_read_xxx() variant thereof, that those can use who do
want that.

Now whether to have a kind of barrier ("volatile", whatever) in the
atomic_read_xxx() itself, or whether to make the code writer himself to
explicitly write the order(x) appropriately in appropriate places in the
code _is_ a matter of taste.

> > That's definitely the point, why not. This is why "barrier()", being
> > heavy-handed, is not the best option.
> 
> That is _not_ the point [...]

Again, you're requiring me to repeat things that were already made evident
on this thread (if you follow it).

This _is_ the point, because a lot of loops out there (too many of them,
I WILL NOT bother citing file_name:line_number) end up having to use a
barrier just because they're using a loop-exit-condition that depends
on a value returned by atomic_read(). It would be good for them if they
used an atomic_read_xxx() primitive that gave these "volatility" semantics
without junking compiler optimizations for other memory references.

> because there has already been an alternative posted

Whether that alternative (explicitly using forget(x), or wrappers thereof,
such as the "order_atomic" you proposed) is better than other alternatives
(such as atomic_read_xxx() which includes the volatility behaviour in
itself) is still open, and precisely what we started discussing just one
mail back.

(The above was also mostly stuff I had to repeated for you, sadly.)

> that better conforms with Linux barrier
> API and is much more widely useful and more usable.

I don't think so.

(Now *this* _is_ the "taste-dependent matter" that I mentioned earlier.)

> If you are so worried
> about
> barrier() being too heavyweight, then you're off to a poor start by wanting to
> add a few K of kernel text by making atomic_read volatile.

Repeating myself, for the N'th time, NO, I DON'T want to make atomic_read
have "volatile" semantics.

> > > because as I also mentioned, the logical extention
> > > to Linux's barrier API to handle this is the order(x) macro. Again, not
> > > special volatile accessors.
> > 
> > Sure, that forget(x) macro _is_ proposed to be made part of the generic
> > API. Doesn't explain why not to define/use primitives that has volatility

> > semantics in itself, though (taste matters apart).
^
> If you follow the discussion You were thinking of a reason why the
> semantics *should* be changed or added, and I was rebutting your argument
> that it must be used when a full barrier() is too heavy (ie. by pointing
> out that order() has superior semantics anyway).

Amazing. Either you have reading comprehension problems, or else, please
try reading this thread (or at least this sub-thread) again. I don't want
_you_ blaming _me_ for having to repeat things to you all over again.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] hostfs: Remove pointless if statement

2007-08-17 Thread Jeff Dike

On Fri, Aug 17, 2007 at 12:24:23PM +0530, Satyam Sharma wrote:
> We normally use "comments" for that, not dead code that a compiler
> then elids ;-)

I'd argue that comments are for when you can't make the code
self-explanatory.

> [PATCH] hostfs: Remove pointless if statement
> 
> And replace with comment saying we don't handle ctime.
> Also some codingstyle while I was there.

I'll forward this upstream.

Jeff

-- 
Work email - jdike at linux dot intel dot com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH/RFC 3/4]Introduce "account modifiers" mechanism

2007-08-17 Thread Christian Borntraeger

Am Freitag, 17. August 2007 schrieb Laurent Vivier:
> > The normal user/system accounting has the same issue, no?  Whereever we
> > happen to land (kernel or user) gets the whole tick.
> 
> Yes... but perhaps I should rewrite this too ;-)

If you look further, you will see, that this was actually rewritten in 2.6.12 
and thats why we have cputime_t. The infrastructure is currently only used by 
s390 and ppc64. On s390 we use our cpu timer to get the current time on each 
syscall/irq/context switch etc to get precise accounting data for 
system/user/irq/softirq/nice. We also get steal time (this is interesting for 
the guest: how much of my cpu was actually not available because the 
hypervisor scheduled me away). I dont know enough about other architectures  
to say which one could exploit this infrastructure as well.

The current git tree is somewhat broken for CONFIG_VIRT_CPU_ACCOUTING due to 
the accouting changes introduced by CFS - we work on this.

If you are interested in the cputime stuff, you can have a look at 
arch/s390/kernel/vtime.c (account_system_vtime, account_vtime) as well as:

http://git.kernel.org/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commit;h=6be7071fdd321c206b1ee7a3e534782f25572830
for the first introduction and
http://git.kernel.org/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commit;h=db59f519d37f80683f3611927f6b5c3bdfc0f9e6
for the s390 exploitation.

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH/RFC 3/4]Introduce "account modifiers" mechanism

2007-08-17 Thread Laurent Vivier

Avi Kivity wrote:
[...]
> 
> The normal user/system accounting has the same issue, no?  Whereever we
> happen to land (kernel or user) gets the whole tick.
> 
> So I think it is okay to have the same limitation for guest time.
> 

So this is how it looks like.
PATCH 1 and 2 are always a prerequisite.

Laurent
Index: kvm/include/linux/sched.h
===
--- kvm.orig/include/linux/sched.h  2007-08-17 15:07:02.0 +0200
+++ kvm/include/linux/sched.h   2007-08-17 15:08:19.0 +0200
@@ -1310,6 +1310,7 @@
 #define PF_STARTING0x0002  /* being created */
 #define PF_EXITING 0x0004  /* getting shut down */
 #define PF_EXITPIDONE  0x0008  /* pi exit done on shut down */
+#define PF_VCPU0x0010  /* I'm a virtual CPU */
 #define PF_FORKNOEXEC  0x0040  /* forked but didn't exec */
 #define PF_SUPERPRIV   0x0100  /* used super-user privileges */
 #define PF_DUMPCORE0x0200  /* dumped core */
Index: kvm/kernel/sched.c
===
--- kvm.orig/kernel/sched.c 2007-08-17 14:42:43.0 +0200
+++ kvm/kernel/sched.c  2007-08-17 15:16:20.0 +0200
@@ -3246,10 +3246,22 @@
struct rq *rq = this_rq();
cputime64_t tmp;
 
+   tmp = cputime_to_cputime64(cputime);
+   if (p->flags & PF_VCPU) {
+   p->utime = cputime_add(p->utime, cputime);
+   p->gtime = cputime_add(p->gtime, cputime);
+
+   cpustat->guest = cputime64_add(cpustat->guest, tmp);
+   cpustat->user = cputime64_add(cpustat->user, tmp);
+
+   p->flags &= ~PF_VCPU;
+
+   return;
+   }
+
p->stime = cputime_add(p->stime, cputime);
 
/* Add system time to cpustat. */
-   tmp = cputime_to_cputime64(cputime);
if (hardirq_count() - hardirq_offset)
cpustat->irq = cputime64_add(cpustat->irq, tmp);
else if (softirq_count())
Index: kvm/drivers/kvm/kvm.h
===
--- kvm.orig/drivers/kvm/kvm.h  2007-08-17 15:26:16.0 +0200
+++ kvm/drivers/kvm/kvm.h   2007-08-17 15:29:46.0 +0200
@@ -589,6 +589,19 @@
 
 int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run);
 
+#ifndef PF_VCPU
+#define PF_VCPU0   /* no kernel support */
+#endif
+
+static inline void kvm_guest_enter(void)
+{
+   current->flags |= PF_VCPU;
+}
+
+static inline void kvm_guest_exit(void)
+{
+}
+
 static inline int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 u32 error_code)
 {
Index: kvm/drivers/kvm/svm.c
===
--- kvm.orig/drivers/kvm/svm.c  2007-08-17 15:26:16.0 +0200
+++ kvm/drivers/kvm/svm.c   2007-08-17 15:27:03.0 +0200
@@ -1404,6 +1404,7 @@
clgi();
 
vcpu->guest_mode = 1;
+   kvm_guest_enter();
if (vcpu->requests)
if (test_and_clear_bit(KVM_TLB_FLUSH, &vcpu->requests))
svm_flush_tlb(vcpu);
@@ -1536,6 +1537,7 @@
 #endif
: "cc", "memory" );
 
+   kvm_guest_exit();
vcpu->guest_mode = 0;
 
if (vcpu->fpu_active) {
Index: kvm/drivers/kvm/vmx.c
===
--- kvm.orig/drivers/kvm/vmx.c  2007-08-17 15:26:16.0 +0200
+++ kvm/drivers/kvm/vmx.c   2007-08-17 15:27:45.0 +0200
@@ -2078,6 +2078,7 @@
local_irq_disable();
 
vcpu->guest_mode = 1;
+   kvm_guest_enter();
if (vcpu->requests)
if (test_and_clear_bit(KVM_TLB_FLUSH, &vcpu->requests))
vmx_flush_tlb(vcpu);
@@ -2198,6 +2199,7 @@
[cr2]"i"(offsetof(struct kvm_vcpu, cr2))
  : "cc", "memory" );
 
+   kvm_guest_exit();
vcpu->guest_mode = 0;
local_irq_enable();

[PATCH] Isolate some explicit usage of task->tgid

2007-08-17 Thread Pavel Emelyanov

With pid namespaces this field is now dangerous to use explicitly, 
so hide it behind the helpers.


Also the pid and pgrp fields o task_struct and signal_struct are 
to be deprecated. Unfortunately this patch cannot be sent right now 
as this leads to tons of warnings, so start isolating them, and  
deprecate later. 

Actually the p->tgid == pid has to be changed to has_group_leader_pid(), 
but Oleg pointed out that this is the same and thread_group_leader() 
is more preferable. 


Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>
Acked-by: Oleg Nesterov <[EMAIL PROTECTED]>

---

fs/exec.c |4 ++--
fs/proc/base.c|4 ++--
include/linux/sched.h |6 ++
kernel/posix-cpu-timers.c |   12 ++--
kernel/posix-timers.c |4 ++--
kernel/ptrace.c   |2 +-
kernel/signal.c   |2 +-
mm/oom_kill.c |2 +-
8 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f8cebf8..24b669d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1691,6 +1691,12 @@ static inline int has_group_leader_pid(s
return p->pid == p->tgid;
}

+static inline
+int same_thread_group(struct task_struct *p1, struct task_struct *p2)
+{
+   return p1->tgid == p2->tgid;
+}
+
static inline struct task_struct *next_thread(const struct task_struct *p)
{
return list_entry(rcu_dereference(p->thread_group.next),
diff --git a/fs/exec.c b/fs/exec.c
index 35013f2..415cefb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -865,8 +865,8 @@ static int de_thread(struct task_struct 


write_lock_irq(&tasklist_lock);

-   BUG_ON(leader->tgid != tsk->tgid);
-   BUG_ON(tsk->pid == tsk->tgid);
+   BUG_ON(!same_thread_group(leader, tsk));
+   BUG_ON(thread_group_leader(tsk));
/*
 * An exec() starts a new thread group with the
 * TGID of the previous thread group. Rehash the
diff --git a/fs/proc/base.c b/fs/proc/base.c
index e3009ab..31e7dfe 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2288,7 +2288,7 @@ retry:
 * found doesn't happen to be a thread group leader.
 * As we don't care in the case of readdir.
 */
-   if (!task || !has_group_leader_pid(task))
+   if (!task || !thread_group_leader(task))
goto retry;
get_task_struct(task);
}
@@ -2474,7 +2474,7 @@ static struct dentry *proc_task_lookup(s
rcu_read_unlock();
if (!task)
goto out;
-   if (leader->tgid != task->tgid)
+   if (!same_thread_group(leader, task))
goto out_drop_task;

result = proc_task_instantiate(dir, dentry, task, NULL);
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index b53c8fc..68c9637 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -21,8 +21,8 @@ static int check_clock(const clockid_t w

read_lock(&tasklist_lock);
p = find_task_by_pid(pid);
-   if (!p || (CPUCLOCK_PERTHREAD(which_clock) ?
-  p->tgid != current->tgid : p->tgid != pid)) {
+   if (!p || !(CPUCLOCK_PERTHREAD(which_clock) ?
+  same_thread_group(p, current) : thread_group_leader(p))) {
error = -EINVAL;
}
read_unlock(&tasklist_lock);
@@ -308,13 +308,13 @@ int posix_cpu_clock_get(const clockid_t 
		p = find_task_by_pid(pid);

if (p) {
if (CPUCLOCK_PERTHREAD(which_clock)) {
-   if (p->tgid == current->tgid) {
+   if (same_thread_group(p, current)) {
error = cpu_clock_sample(which_clock,
 p, &rtn);
}
} else {
read_lock(&tasklist_lock);
-   if (p->tgid == pid && p->signal) {
+   if (thread_group_leader(p) && p->signal) {
error =
cpu_clock_sample_group(which_clock,
   p, &rtn);
@@ -355,7 +355,7 @@ int posix_cpu_timer_create(struct k_itim
p = current;
} else {
p = find_task_by_pid(pid);
-   if (p && p->tgid != current->tgid)
+   if (p && !same_thread_group(p, current))
p = NULL;
}
} else {
@@ -363,7 +363,7 @@ int posix_cpu_timer_create(struct k_itim
p = current->group_leader;
} else {
p = find_task_by_pid(pid);
-   if

Re: System call interposition/unprotecting the table

2007-08-17 Thread Dave Jones

On Wed, Aug 15, 2007 at 12:48:35AM +0200, Andi Kleen wrote:

 > > > In general the .data protection is only considered a debugging
 > > > feature. I don't know why Fedora enables it in their production
 > > > kernels.
 > > 
 > > That would be because we think you are wrong 8)
 > 
 > Well, it might at best buy you a few weeks/months in
 > terms of the exploit arms race, but thrash your user's TLBs
 > forever.

Show me a single situation where this matters.
When we first enabled, we tried both benchmarks and real-world
loads, and it didn't matter at all.  Unless something fundamental
has changed since then, the story should still be the same.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] [SCSI] arcmsr: Fix error handling

2007-08-17 Thread Maik Hampel

Fixed error handling in queuecommand(), now all READ_ and WRITE_ commands
are aborted in case of RAID is gone. Before only READ_6 and WRITE_6 commands
were aborted.

Signed-off-by: Maik Hampel <[EMAIL PROTECTED]>
---
diff --git a/drivers/scsi/arcmsr/arcmsr_hba.c b/drivers/scsi/arcmsr/arcmsr_hba.c
index 0ddfc21..7a001ad 100644
--- a/drivers/scsi/arcmsr/arcmsr_hba.c
+++ b/drivers/scsi/arcmsr/arcmsr_hba.c
@@ -1133,17 +1133,27 @@ static int arcmsr_queue_command(struct scsi_cmnd *cmd,
uint8_t block_cmd;
 
block_cmd = cmd->cmnd[0] & 0x0f;
-   if (block_cmd == 0x08 || block_cmd == 0x0a) {
-   printk(KERN_NOTICE
-   "arcmsr%d: block 'read/write'"
-   "command with gone raid volume"
-   " Cmd = %2x, TargetId = %d, Lun = %d \n"
-   , acb->host->host_no
-   , cmd->cmnd[0]
-   , target, lun);
-   cmd->result = (DID_NO_CONNECT << 16);
-   cmd->scsi_done(cmd);
-   return 0;
+   switch (block_cmd) {
+   case READ_6:
+   case READ_10:
+   case READ_12:
+   case READ_16:
+   case WRITE_6:
+   case WRITE_10:
+   case WRITE_12:
+   case WRITE_16:
+   printk(KERN_NOTICE
+   "arcmsr%d: block 'read/write'"
+   "command with gone raid volume"
+   " Cmd = %2x, TargetId = %d, Lun = %d \n"
+   , acb->host->host_no
+   , cmd->cmnd[0]
+   , target, lun);
+   cmd->result = (DID_NO_CONNECT << 16);
+   cmd->scsi_done(cmd);
+   return 0;
+   default:
+   break;
}
}
if (atomic_read(&acb->ccboutstandingcount) >=
-- 
1.4.4.4



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Paul E. McKenney

On Fri, Aug 17, 2007 at 01:09:08PM +0530, Satyam Sharma wrote:
> 
> 
> On Thu, 16 Aug 2007, Paul E. McKenney wrote:
> 
> > On Fri, Aug 17, 2007 at 07:59:02AM +0800, Herbert Xu wrote:
> > > On Thu, Aug 16, 2007 at 09:34:41AM -0700, Paul E. McKenney wrote:
> > > >
> > > > The compiler can also reorder non-volatile accesses.  For an example
> > > > patch that cares about this, please see:
> > > > 
> > > > http://lkml.org/lkml/2007/8/7/280
> > > > 
> > > > This patch uses an ORDERED_WRT_IRQ() in rcu_read_lock() and
> > > > rcu_read_unlock() to ensure that accesses aren't reordered with respect
> > > > to interrupt handlers and NMIs/SMIs running on that same CPU.
> > > 
> > > Good, finally we have some code to discuss (even though it's
> > > not actually in the kernel yet).
> > 
> > There was some earlier in this thread as well.
> 
> Hmm, I never quite got what all this interrupt/NMI/SMI handling and
> RCU business you mentioned earlier was all about, but now that you've
> pointed to the actual code and issues with it ...

Glad to help...

> > > First of all, I think this illustrates that what you want
> > > here has nothing to do with atomic ops.  The ORDERED_WRT_IRQ
> > > macro occurs a lot more times in your patch than atomic
> > > reads/sets.  So *assuming* that it was necessary at all,
> > > then having an ordered variant of the atomic_read/atomic_set
> > > ops could do just as well.
> > 
> > Indeed.  If I could trust atomic_read()/atomic_set() to cause the compiler
> > to maintain ordering, then I could just use them instead of having to
> > create an  ORDERED_WRT_IRQ().  (Or ACCESS_ONCE(), as it is called in a
> > different patch.)
> 
> +#define WHATEVER(x)  (*(volatile typeof(x) *)&(x))
> 
> I suppose one could want volatile access semantics for stuff that's
> a bit-field too, no?

One could, but this is not supported in general.  So if you want that,
you need to use the usual bit-mask tricks and (for setting) atomic
operations.

> Also, this gives *zero* "re-ordering" guarantees that your code wants
> as you've explained it below) -- neither w.r.t. CPU re-ordering (which
> probably you don't care about) *nor* w.r.t. compiler re-ordering
> (which you definitely _do_ care about).

You are correct about CPU re-ordering (and about the fact that this
example doesn't care about it), but not about compiler re-ordering.

The compiler is prohibited from moving a volatile access across a sequence
point.  One example of a sequence point is a statement boundary.  Because
all of the volatile accesses in this code are separated by statement
boundaries, a conforming compiler is prohibited from reordering them.

> > > However, I still don't know which atomic_read/atomic_set in
> > > your patch would be broken if there were no volatile.  Could
> > > you please point them out?
> > 
> > Suppose I tried replacing the ORDERED_WRT_IRQ() calls with
> > atomic_read() and atomic_set().  Starting with __rcu_read_lock():
> > 
> > o   If "ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++"
> > was ordered by the compiler after
> > "ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1", then
> > suppose an NMI/SMI happened after the rcu_read_lock_nesting but
> > before the rcu_flipctr.
> > 
> > Then if there was an rcu_read_lock() in the SMI/NMI
> > handler (which is perfectly legal), the nested rcu_read_lock()
> > would believe that it could take the then-clause of the
> > enclosing "if" statement.  But because the rcu_flipctr per-CPU
> > variable had not yet been incremented, an RCU updater would
> > be within its rights to assume that there were no RCU reads
> > in progress, thus possibly yanking a data structure out from
> > under the reader in the SMI/NMI function.
> > 
> > Fatal outcome.  Note that only one CPU is involved here
> > because these are all either per-CPU or per-task variables.
> 
> Ok, so you don't care about CPU re-ordering. Still, I should let you know
> that your ORDERED_WRT_IRQ() -- bad name, btw -- is still buggy. What you
> want is a full compiler optimization barrier().

No.  See above.

> [ Your code probably works now, and emits correct code, but that's
>   just because of gcc did what it did. Nothing in any standard,
>   or in any documented behaviour of gcc, or anything about the real
>   (or expected) semantics of "volatile" is protecting the code here. ]

Really?  Why doesn't the prohibition against moving volatile accesses
across sequence points take care of this?

> > o   If "ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1"
> > was ordered by the compiler to follow the
> > "ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx", and an NMI/SMI
> > happened between the two, then an __rcu_read_lock() in the NMI/SMI
> > would incorrectly take the "else" clause of the enclosing "if"
> > statement.  If some other CPU flipped the rcu_ctrlblk.completed
> > in the meantime, then the __rcu_read_lock() would (correctl

1 2 3 4 >

1 - 100 of 332 matches

Mail list logo