Re: [PATCH 0/3] posix timers: Extend kernel API to report more info about timers

2013-02-20 Thread Matthew Helsley
On Thu, Feb 14, 2013 at 8:18 AM, Pavel Emelyanov  wrote:
> Hi.
>
> I'm working on the checkpoint-restore project (http://criu.org), briefly
> it's aim is to collect information about process' state and saving it so
> that later it is possible to recreate the processes in the very same state
> as they were, using the collected information.
>
> One part of the task's state is the posix timers that this task has created.
> Currently kernel doesn't provide any API for getting information about
> what timers are currently created by process and in which state they are.
> I'd like to extend the posix timers API to provide more information about
> timers.
>
> Another problem with timers is the timer ID. Currently IDs are generated
> from global IDR and this makes it impossible to restore a timer from
> the saved state in general, as the required ID may be already busy at the
> time of restore.
>
> That said, I propose to
>
> 1. Change the way timer IDs are generated. This was done some time ago, so
>I'm just re-sending this patch;

Seems fine in principle. Aside: I noticed there were some
important-looking patches to the idr usage in timer id allocation
today...

> 2. Add a system call that will list timer IDs created by the calling process;

If timers were listed in /proc like fds then you wouldn't need this
syscall. If we keep adding new syscalls like this CRIU will be
needlessly x86-specific when it could have been written more portably.

> 3. Add a system call that will allow to get the sigevent information about
>particular timer in the sigaction-like manner.

You mentioned "extending the POSIX timer API". Isn't that something
best left to standards bodies lest your changes conflict with theirs?
Again, if this were a /proc interface you wouldn't have that issue
(you'll have others ;)).

>
> This is actually an RFC to start discussion about how the described problems
> can be addressed. Thus, if the approach with new system calls is not 
> acceptable,
> I'm OK to implement this in any other form.

My preference is for "other form" for the reasons above.

Cheers,
-Matt Helsley
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 5/7] fs, epoll: Add procfs fdinfo helper

2012-07-19 Thread Matthew Helsley
On Wed, Jun 27, 2012 at 4:01 AM, Cyrill Gorcunov  wrote:
> This allow us to print out eventpoll target file descriptor,
> events and data, the /proc/pid/fdinfo/fd consists of
>
>  | pos: 0
>  | flags:   02
>  | tfd:5 events:   1d data: 
>
> This feature is CONFIG_CHECKPOINT_RESTORE only.
>
> Signed-off-by: Cyrill Gorcunov 
> CC: Al Viro 
> CC: Alexey Dobriyan 
> CC: Andrew Morton 
> CC: Pavel Emelyanov 
> CC: James Bottomley 
> ---
>  fs/eventpoll.c |   81 
> +
>  1 file changed, 81 insertions(+)
>
> Index: linux-2.6.git/fs/eventpoll.c
> ===
> --- linux-2.6.git.orig/fs/eventpoll.c
> +++ linux-2.6.git/fs/eventpoll.c
> @@ -38,6 +38,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>
>  /*
>   * LOCKING:
> @@ -1897,6 +1899,83 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd,
> return error;
>  }
>
> +#if defined(CONFIG_PROC_FS) && defined(CONFIG_CHECKPOINT_RESTORE)
> +
> +struct epitem_fdinfo {
> +   struct epoll_event  ev;
> +   int fd;
> +};
> +
> +static struct epitem_fdinfo *
> +seq_lookup_fdinfo(struct proc_fdinfo_extra *extra, struct eventpoll *ep, 
> loff_t num)
> +{
> +   struct epitem_fdinfo *fdinfo = extra->priv;
> +   struct epitem *epi = NULL;
> +   struct rb_node *rbp;
> +
> +   mutex_lock(&ep->mtx);
> +   for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) {
> +   if (num-- == 0) {
> +   epi = rb_entry(rbp, struct epitem, rbn);
> +   fdinfo->fd = epi->ffd.fd;
> +   fdinfo->ev = epi->event;
> +   break;

This will be incredibly slow. epoll was designed to scale to tens of
thousands of file descriptors. This algorithm is O(N^2) because each
time we show a new epoll item we walk through the whole rb tree again
(we're not doing a search so it isn't O(NlogN)).

Also, we could miss one or more later items if one of the earlier
items is removed from the epoll set in between "seq_lookup_fdinfo"
calls. This isn't a problem for checkpoint because we assume the task
(and everything with this eventpoll file in its fd table) is frozen.
However it means the file will be worse than useless for almost any
other purpose because they are unlikely to realize they need to freeze
all the task(s) to get consistent data.

Cheers,
-Matt Helsley
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] add task handling notifier: base definitions

2008-01-08 Thread Matthew Helsley
On Thu, 2007-12-20 at 13:12 +, Jan Beulich wrote:
> This is the base patch, adding notification for task creation and
> deletion.
> 
> Signed-off-by: Jan Beulich <[EMAIL PROTECTED]>
> ---
>  include/linux/sched.h |8 +++-
>  kernel/fork.c |   11 +++
>  2 files changed, 18 insertions(+), 1 deletion(-)
> 
> --- 2.6.24-rc5-notify-task.orig/include/linux/sched.h
> +++ 2.6.24-rc5-notify-task/include/linux/sched.h
> @@ -80,7 +80,7 @@ struct sched_param {
>  #include 
>  #include 
>  #include 
> -
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1700,6 +1700,12 @@ extern int do_execve(char *, char __user
>  extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned 
> long, int __user *, int __user *);
>  struct task_struct *fork_idle(int);
> 
> +#define TASK_NEW 1
> +#define TASK_DELETE 2
> +
> +extern struct blocking_notifier_head task_notifier_list;
> +extern struct atomic_notifier_head atomic_task_notifier_list;
> +
>  extern void set_task_comm(struct task_struct *tsk, char *from);
>  extern void get_task_comm(char *to, struct task_struct *tsk);
> 
> --- 2.6.24-rc5-notify-task.orig/kernel/fork.c
> +++ 2.6.24-rc5-notify-task/kernel/fork.c
> @@ -46,6 +46,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -71,6 +72,11 @@ DEFINE_PER_CPU(unsigned long, process_co
> 
>  __cacheline_aligned DEFINE_RWLOCK(tasklist_lock);  /* outer */
> 
> +BLOCKING_NOTIFIER_HEAD(task_notifier_list);
> +EXPORT_SYMBOL_GPL(task_notifier_list);
> +ATOMIC_NOTIFIER_HEAD(atomic_task_notifier_list);
> +EXPORT_SYMBOL_GPL(atomic_task_notifier_list);
> +

When these global notifier lists were proposed years ago folks at SGI
loudly objected with concerns over anticipated cache line bouncing on
512+ cpu machines. Is that no longer a concern?

>  int nr_processes(void)
>  {
>   int cpu;
> @@ -121,6 +127,9 @@ void __put_task_struct(struct task_struc
>   WARN_ON(atomic_read(&tsk->usage));
>   WARN_ON(tsk == current);
> 
> + atomic_notifier_call_chain(&atomic_task_notifier_list,
> +TASK_DELETE, tsk);
> +
>   security_task_free(tsk);
>   free_uid(tsk->user);
>   put_group_info(tsk->group_info);

Would the atomic notifier call chain be necessary if you hooked into an
earlier section of do_exit() instead?

Cheers,
-Matt Helsley

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] add task handling notifier

2008-01-08 Thread Matthew Helsley
On Tue, 2008-01-08 at 18:24 -0800, Matt Helsley wrote:
> On Sun, 2007-12-23 at 12:26 +, Christoph Hellwig wrote:
> > On Thu, Dec 20, 2007 at 01:11:24PM +, Jan Beulich wrote:
> > > With more and more sub-systems/sub-components leaving their footprint
> > > in task handling functions, it seems reasonable to add notifiers that
> > > these components can use instead of having them all patch themselves
> > > directly into core files.
> > 
> > I agree that we probably want something like this.  As do some others,
> > so we already had a few a few attempts at similar things.  The first one
> > is from SGI and called PAGG (http://oss.sgi.com/projects/pagg/) and also
> > includes allocating per-task data for it's users.  Then also from SGI
> > there has been a simplified version called pnotify that's also available
> > from the website above.
> > 
> > Later Matt Helsley had something called "Task Watchers" which lwn has
> > an article on: http://lwn.net/Articles/208117/.
> 
> Apologies for the late reply -- I haven't had internet access for the
> last few weeks.
> 
> > For some reason neither ever made a lot of progess (performance
> > problems?).
> 
> Yeah. Some discussion on measuring the performance of Task Watchers:
> http://thread.gmane.org/gmane.linux.lse/4698
> 
> The requirements for Task Watchers were:
> 
> Allow sleeping in most/all notifier functions in these paths:
>   fork
>   exec
>   exit
>   change [re][ug]id
> No performance overhead
> One "chain" per path ("I only care about exec().")
> Easy to use
> Scales to large numbers of CPUs
> Useful to make most in-tree code more readable. Task Watchers took
> direct calls to these pieces of code out of the fork/exec/exit paths:
>   audit
>   semundo
>   cpusets
>   mempolicy
>   trace irqflags
>   lockdep
>   keys (for processes -- not for thread groups)
>   process events connector
> Useful for loadable modules
> 
> Performance overhead in microbenchmarks was measurable at around 1% (see
> the URL above). Overhead on benchmarks like kernbench on the other hand
> were in the noise margins (which were around 1.6%) and hence I couldn't
> determine the overhead there.
> 
> I never got the loadable module part completely working due to races
> between notifier functions and the module unload path. The solution to
> the races seemed to require adding more overhead to the notifier
> function paths (SRCU-like grace periods).
> 
> I stopped pushing the patch set because I hadn't found any new
> optimizations to offset the overheads while still meeting all the
> requirements and Andrew still felt that the "make it more readable"
> argument was not sufficient to justify its inclusion.

Oops. It's been nearly two years so I've forgotten exactly where Task
Watchers v2 was when I stopped pushing it. After a bit more searching I
found a more recent posting:
http://lkml.org/lkml/2006/12/14/384

And here's why I think the microbenchmark results improved to the point
there was a small performance improvement over mainline:
http://lkml.org/lkml/2006/12/19/124

I seem to recall kernbench was still too noisy to tell.

The patch allowing modules to register Task Watchers still isn't posted
there for the reasons I've already described.

Cheers,
-Matt Helsley

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] State limits to safety of _safe iterators

2007-09-13 Thread Matthew Helsley
On Wed, 2007-09-12 at 18:01 -0700, Paul E. McKenney wrote:
> The _safe list iterators make a blanket statement about how they are
> safe against removal.  This patch, inspired by private conversations
> with people who unwisely but perhaps understandably took this blanket
> statement at its word, adds comments stating limits to this safety.
> 
> Signed-off-by: Paul E. McKenney <[EMAIL PROTECTED]>
> ---
> 
>  list.h |   42 ++
>  1 file changed, 42 insertions(+)
> 
> diff -urpNa -X dontdiff linux-2.6.22/include/linux/list.h 
> linux-2.6.22-safedoc/include/linux/list.h
> --- linux-2.6.22/include/linux/list.h 2007-07-08 16:32:17.0 -0700
> +++ linux-2.6.22-safedoc/include/linux/list.h 2007-09-12 17:45:38.0 
> -0700
> @@ -472,6 +472,12 @@ static inline void list_splice_init_rcu(
>   * @pos: the &struct list_head to use as a loop cursor.
>   * @n:   another &struct list_head to use as temporary storage
>   * @head:the head for your list.
> + *
> + * Please note that this is safe only against removal by the code in

I'm not trying to be snarky but how far should we go before expecting
folks to read the macros? Depending on the answer you may also want to
mention that without additional additional code it's safe only against
removal of the list element at pos.



Cheers,
-Matt Helsley

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/6] lockstat: human readability tweaks

2007-05-30 Thread Matthew Helsley
On Wed, 2007-05-30 at 14:49 +0200, Peter Zijlstra wrote:
> plain text document attachment (lockstat-output.patch)
> Present all this fancy new lock statistics information:
> 
> *warning, _wide_ output ahead*
> 
> (output edited for purpose of brevity)
> 
>  # cat /proc/lock_stat
> lock_stat version 0.1
> -
>   class namecontentions   waittime-min   
> waittime-max waittime-total   acquisitions   holdtime-min   holdtime-max 
> holdtime-total
> -



> 'contentions' and 'acquisitions' are the number of such events measured 
> (since 
> the last reset). The waittime- and holdtime- (min, max, total) numbers are 
> presented in microseconds.

I think it would make sense to actually mention the time scale in the
output header someplace. Then a tool written to analyze this file will
have a way of determining the time scale without using error-prone
heuristics (like "kernel version foo uses microseconds while kernel foo
+ 100 uses nanoseconds").



Cheers,
-Matt Helsley

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Any faster and more efficient way to repeatedly access /proc/*

2007-03-08 Thread Matthew Helsley
On Thu, 2007-03-08 at 16:55 -0800, [EMAIL PROTECTED] wrote:
> Hi,
> 
> Is there a faster way to access "/proc/*" other than open it as a file and 
> reading/parsing contents? e.g. fopen("/proc/stat", "r");
> 
> In BSD, there is the kvm method of access, which is relatively fast (light 
> weight)
> 
> In Linux, if I have a daemon that keeps track of these statistics, it's a 
> hell way to manage.
> 
> Imagine, having to probe the stat of each process?

Have you looked at Task Stats (CONFIG_TASKSTATS)?

Cheers,
-Matt Helsley

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH][1/4] RSS controller setup

2007-02-19 Thread Matthew Helsley
On Mon, 2007-02-19 at 16:43 +0530, Balbir Singh wrote:
> Paul Menage wrote:
> > On 2/19/07, Andrew Morton <[EMAIL PROTECTED]> wrote:



> > Hmm, I don't appear to have documented this yet, but I think a good
> > naming scheme for container files is . - i.e.
> > these should be memctlr.usage and memctlr.limit. The existing
> > grandfathered Cpusets names violate this, but I'm not sure there's a
> > lot we can do about that.
> > 
> 
> Why ., dots are harder to parse using regular
> expressions and sound DOS'ish. I'd prefer "_" to separate the
> subsystem and whatever :-)

"_" is useful for names with "spaces". Names like mem_controller. "."
seems reasonable despite its regex nastyness. Alternatively there's
always ":".



Cheers,
-Matt Helsley

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS

2005-09-09 Thread Matthew Helsley
On Fri, 2005-09-09 at 13:12 +0900, Magnus Damm wrote:
> On 9/9/05, KUROSAWA Takahiro <[EMAIL PROTECTED]> wrote:
> > On Thu, 8 Sep 2005 05:02:32 -0700
> > Paul Jackson <[EMAIL PROTECTED]> wrote:
> > > One of my passions is to avoid special cases across API boundaries.
> > >
> > > I am proposing that you don't do subcpusets like this.
> > >
> > > Consider the following alternative I will call 'cpuset meters'.
> > >
> > > For each resource named 'R' (cpu and mem, for instance):
> > >  * Add a boolean flag 'meter_R' to each cpuset.  If set, that R is
> > >metered, for the tasks in that cpuset or any descendent cpuset.
> > >  * If a cpuset is metered, then files named meter_R_guar, meter_R_lim
> > >and meter_R_cur appear in that cpuset to manage R's usage by tasks
> > >in that cpuset and descendents.
> > >  * There are no additional rules that restrict the ability to change
> > >various other cpuset properties such as cpus, mems, cpu_exclusive,
> > >mem_exclusive, or notify_on_release, when a cpuset is metered.
> > >  * It might be that some (or by design all) resource controllers do
> > >not allow nesting metered cpusets.  I don't know.  But one should
> > >(if one has permission) be able to make child cpusets of a metered
> > >cpuset, just like one can of any other cpuset.
> > >  * A metered cpuset might well have cpus or mems that are not the
> > >same as its parent, just like an unmetered cpuset ordinarly does.
> > 
> > Jackson-san's idea looks good for me because users don't need
> > to create special cpusets (parents of subcpusets or subcpusets).
> > From the point of users, maybe they wouldn't like to create
> > special cpusets.
> 
> Yes, from the user POV it must be good to keep the hierarchical model.
> Ckrm and cpusets both provide a tree with descendents, children and
> parents. This hierarchical model is very nice IMO and provides a
> powerful API for the user.
> 
> > As for the resource controller that I've posted, it assumes
> > that there are groups of tasks that share the same cpumasks/nodemasks,
> > and that there are no hierarchy in that groups in order to make
> > things easier.  I'll investigate how I can attach the resource
> > controller to the cpuset meters.
> 
> Subcpusets, compared to cpusets and ckrm, gives the user a flat model.
> No hierarchy. Limited functionality compared to the hierachical model.
> 
> But what I think is important to keep in mind here is that cpusets and
> subcpusets do very different things. If I understand cpusets
> correctly, each cpuset may share processors or memory nodes with other
> cpusets. One task running on a shared processor may starve other
> cpusets using the same processor. This design works well with cpusets,
> but for resource controllers that must provide some kind of guarantee,
> this starvation is unsuitable.
> 
> And we already have an hierarchical alternative: ckrm. But look at the
> complexity and the amount of code. I believe that the complexity in
> ckrm mainly comes from the hierarchical model.

I've been trying to find ways to significantly reduce CKRM's kernel
footprint. I recently posted an RFC patch to CKRM-Tech describing a
5000-line reduction:
http://sourceforge.net/mailarchive/forum.php?thread_id=8132624&forum_id=35191
Feedback on the approach presented in the RFC patch would be
appreciated.

> Maybe it is possible to have an hierarchical model and keep the
> framework simple and easy to understand while providing guarantees,
> I'm not sure. But until that happens, I'm quite happy with a simple,
> limited flat model.
> 
> / magnus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Cleanup line-wrapping in pgtable.h

2005-08-22 Thread Matthew Helsley
On Wed, 2005-08-17 at 12:45 -0500, Adam Litke wrote:
> The line-wrapping in most of the include/asm/pgtable.h pte test/set
> macros looks horrible in my 80 column terminal.  The following "test the
> waters" patch is how I would like to see them laid out.  I realize that
> the braces don't adhere to CodingStyle but the advantage is (when taking
> wrapping into account) that the code takes up no additional space.  How
> do people feel about making this change?  Any better suggestions?  I
> personally wouldn't like a lone closing brace like normal functions
> because of the extra lines eaten.  I volunteer to patch up the other
> architectures if we reach a consensus.
> 
> Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
> 
>  pgtable.h |   51 ++-
>  1 files changed, 34 insertions(+), 17 deletions(-)
> diff -upN reference/include/asm-i386/pgtable.h 
> current/include/asm-i386/pgtable.h
> --- reference/include/asm-i386/pgtable.h
> +++ current/include/asm-i386/pgtable.h
> @@ -215,28 +215,45 @@ extern unsigned long pg0[];
>   * The following only work if pte_present() is true.
>   * Undefined behaviour if not..
>   */
> -static inline int pte_user(pte_t pte){ return (pte).pte_low 
> & _PAGE_USER; }
> -static inline int pte_read(pte_t pte){ return (pte).pte_low 
> & _PAGE_USER; }
> -static inline int pte_dirty(pte_t pte)   { return (pte).pte_low 
> & _PAGE_DIRTY; }
> -static inline int pte_young(pte_t pte)   { return (pte).pte_low 
> & _PAGE_ACCESSED; }
> -static inline int pte_write(pte_t pte)   { return (pte).pte_low 
> & _PAGE_RW; }
> +static inline int pte_user(pte_t pte)
> + { return (pte).pte_low & _PAGE_USER; }
> +static inline int pte_read(pte_t pte)
> + { return (pte).pte_low & _PAGE_USER; }
> +static inline int pte_dirty(pte_t pte)
> + { return (pte).pte_low & _PAGE_DIRTY; }
> +static inline int pte_young(pte_t pte)
> + { return (pte).pte_low & _PAGE_ACCESSED; }
> +static inline int pte_write(pte_t pte)
> + { return (pte).pte_low & _PAGE_RW; }

I think removing the whitespace preceding the opening braces is closer
to CodingStyle, allows for longer lines in the future (however gross
they may be), and does not alter the vertical space consumed on your
display.

>  /*
>   * The following only works if pte_present() is not true.
>   */
> -static inline int pte_file(pte_t pte){ return (pte).pte_low 
> & _PAGE_FILE; }
> +static inline int pte_file(pte_t pte)
> + { return (pte).pte_low & _PAGE_FILE; }
>  
> -static inline pte_t pte_rdprotect(pte_t pte) { (pte).pte_low &= ~_PAGE_USER; 
> return pte; }
> -static inline pte_t pte_exprotect(pte_t pte) { (pte).pte_low &= ~_PAGE_USER; 
> return pte; }
> -static inline pte_t pte_mkclean(pte_t pte)   { (pte).pte_low &= 
> ~_PAGE_DIRTY; return pte; }
> -static inline pte_t pte_mkold(pte_t pte) { (pte).pte_low &= 
> ~_PAGE_ACCESSED; return pte; }
> -static inline pte_t pte_wrprotect(pte_t pte) { (pte).pte_low &= ~_PAGE_RW; 
> return pte; }

As long as this file is being polished, you might move pte_wrprotect()
up so it is between pte_rdprotect() and pte_exprotect().

> -static inline pte_t pte_mkread(pte_t pte){ (pte).pte_low |= _PAGE_USER; 
> return pte; }
> -static inline pte_t pte_mkexec(pte_t pte){ (pte).pte_low |= _PAGE_USER; 
> return pte; }
> -static inline pte_t pte_mkdirty(pte_t pte)   { (pte).pte_low |= _PAGE_DIRTY; 
> return pte; }
> -static inline pte_t pte_mkyoung(pte_t pte)   { (pte).pte_low |= 
> _PAGE_ACCESSED; return pte; }
> -static inline pte_t pte_mkwrite(pte_t pte)   { (pte).pte_low |= _PAGE_RW; 
> return pte; }
> -static inline pte_t pte_mkhuge(pte_t pte){ (pte).pte_low |= 
> _PAGE_PRESENT | _PAGE_PSE; return pte; }
> +static inline pte_t pte_rdprotect(pte_t pte)
> + { (pte).pte_low &= ~_PAGE_USER; return pte; }
> +static inline pte_t pte_exprotect(pte_t pte)
> + { (pte).pte_low &= ~_PAGE_USER; return pte; }
> +static inline pte_t pte_mkclean(pte_t pte)
> + { (pte).pte_low &= ~_PAGE_DIRTY; return pte; }
> +static inline pte_t pte_mkold(pte_t pte)
> + { (pte).pte_low &= ~_PAGE_ACCESSED; return pte; }
> +static inline pte_t pte_wrprotect(pte_t pte)
> + { (pte).pte_low &= ~_PAGE_RW; return pte; }
> +static inline pte_t pte_mkread(pte_t pte)
> + { (pte).pte_low |= _PAGE_USER; return pte; }
> +static inline pte_t pte_mkexec(pte_t pte)
> + { (pte).pte_low |= _PAGE_USER; return pte; }
> +static inline pte_t pte_mkdirty(pte_t pte)
> + { (pte).pte_low |= _PAGE_DIRTY; return pte; }
> +static inline pte_t pte_mkyoung(pte_t pte)
> + { (pte).pte_low |= _PAGE_ACCESSED; return pte; }
> +static inline pte_t pte_mkwrite(pte_t pte)
> + { (pte).pte_low |= _PAGE_RW; return pte; }
> +static inline pte_t pte_mkhuge(pte_t pte)
> + { (pte).pte_low |= _PAGE_PRESENT | _PAGE_PSE; return pte; }
>  
>  #ifdef CONFIG_X86_PAE
>  # include 

Cheers,
-Matt

Re: [ckrm-tech] Re: 2.6.13-rc3-mm1 (ckrm)

2005-07-22 Thread Matthew Helsley
On Fri, 2005-07-22 at 20:23 -0400, Mark Hahn wrote:
> > > actually, let me also say that CKRM is on a continuum that includes 
> > > current (global) /proc tuning for various subsystems, ulimits, and 
> > > at the other end, Xen/VMM's.  it's conceivable that CKRM could wind up
> > > being useful and fast enough to subsume the current global and per-proc
> > > tunables.  after all, there are MANY places where the kernel tries to 
> > > maintain some sort of context to allow it to tune/throttle/readahead
> > > based on some process-linked context.  "embracing and extending"
> > > those could make CKRM attractive to people outside the mainframe market.
> > 
> > Seems like an excellent suggestion to me! Yeah, it may be possible to
> > maintain the context the kernel keeps on a per-class basis instead of
> > globally or per-process. 
> 
> right, but are the CKRM people ready to take this on?  for instance,
> I just grepped 'throttle' in kernel/mm and found a per-task RM in 
> page-writeback.c.  it even has a vaguely class-oriented logic, since
> it exempts RT tasks.  if CKRM can become a way to make this stuff 
> cleaner and more effective (again, for normal tasks), then great.
> but bolting on a big new different, intrusive mechanism that slows
> down all normal jobs by 3% just so someone can run 10K mostly-idle
> guests on a giant Power box, well, that's gross.
> 
> > The real question is what constitutes a useful
> > "extension" :).
> 
> if CKRM is just extensions, I think it should be an external patch.
> if it provides a path towards unifying the many disparate RM mechanisms
> already in the kernel, great!

OK, so if it provides a path towards unifying these, what should happen
to the old interfaces when they conflict with those offered by CKRM?

For instance, I'm considering how a per-class (re)nice setting would
work. What should happen when the user (re)nices a process to a
different value than the nice of the process' class? Should CKRM:

a) disable the old interface by
i) removing it
ii) return an error when CKRM is active
iii) return an error when CKRM has specified a nice value for the
process via membership in a class
iv) return an error when the (re)nice value is inconsistent with the
nice value assigned to the class

b) trust the user, ignore the class nice value, and allow the new nice
value

I'd be tempted to do a.iv but it would require some modifications to a
system call. b probably wouldn't require any modifications to non-CKRM
files/dirs. 

This sort of question would probably come up for any other CKRM
"embraced-and-extended" tunables. Should they use the answer to this
one, or would it go on a case-by-case basis?

Thanks,
-Matt Helsley

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] Re: 2.6.13-rc3-mm1 (ckrm)

2005-07-22 Thread Matthew Helsley
On Fri, 2005-07-22 at 12:35 -0400, Mark Hahn wrote:


> actually, let me also say that CKRM is on a continuum that includes 
> current (global) /proc tuning for various subsystems, ulimits, and 
> at the other end, Xen/VMM's.  it's conceivable that CKRM could wind up
> being useful and fast enough to subsume the current global and per-proc
> tunables.  after all, there are MANY places where the kernel tries to 
> maintain some sort of context to allow it to tune/throttle/readahead
> based on some process-linked context.  "embracing and extending"
> those could make CKRM attractive to people outside the mainframe market.

Seems like an excellent suggestion to me! Yeah, it may be possible to
maintain the context the kernel keeps on a per-class basis instead of
globally or per-process. The real question is what constitutes a useful
"extension" :).

I was thinking that per-class nice values might be a good place to
start as well. One advantage of per-class as opposed to per-process nice
is the class is less transient than the process since its lifetime is
determined solely by the system administrator.

CKRM calls this kind of module a "resource controller". There's a small
HOWTO on writing resource controllers here:
http://ckrm.sourceforge.net/ckrm-controller-howto.txt
If anyone wants to investigate writing such a controller please feel
free to ask questions or send HOWTO feedback on the CKRM-Tech mailing
list at .

Thanks,
-Matt Helsley

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.13-rc3-mm1 (ckrm)

2005-07-21 Thread Matthew Helsley
On Sun, 2005-07-17 at 08:20 -0700, Paul Jackson wrote:


> It is somewhat intrusive in the areas it controls, such as some large
> ifdef's in kernel/sched.c.

I don't see the large ifdefs you're referring to in -mm's
kernel/sched.c.

> The sched hooks may well impact the cost of maintaining the sched code,
> which is always a hotbed of Linux kernel development.  However others
> who work in that area will have to speak to that concern.

I don't see the hooks you're referring to in the -mm scheduler.

> I tried just now to read through the ckrm hooks in fork, to see
> what sort of impact they might have on scalability on large systems.
> But I gave up after a couple layers of indirection.  I saw several
> atomic counters and a couple of spinlocks that I suspect (not at all
> sure) lay on the fork main code path.  I'd be surprised if this didn't
> impact scalability.  Earlier, according to my notes, I saw mention of
> lmbench results in the OLS 2004 slides, indicating a several percent
> cost of available cpu cycles.

The OLS2004 slides are roughly 1 year old. Have you looked at more
recent benchmarks posted on CKRM-Tech around April 15th 2005? They
should be available in the CKRM-Tech archives on SourceForge at
http://sourceforge.net/mailarchive/forum.php?thread_id=7025751&forum_id=35191

(OLS 2004 Slide 24 of
http://ckrm.sourceforge.net/downloads/ckrm-ols04-slides.pdf )

The OLS slide indicates that the overhead is generally less than
0.5usec compared to a total context switch time of anywhere from 2 to
5.5usec. There appears to be little difference in scalability since the
overhead appears to oscillate around a constant.



> vendor has a serious middleware software product that provides full
> CKRM support.  Acceptance of CKRM would be easier if multiple competing
> middleware vendors were using it.  It is also a concern that CKRM
> is not really usable for its primary intended purpose except if it
> is accompanied by this corresponding middleware, which I presume is

The Rule-Based Classification Engine (RBCE) makes CKRM useful without
middleware. It uses a table of rules to classify tasks. For example
rules that would classify shells:

echo 'path=/bin/bash,class=/rcfs/taskclass/shells' > 
/rcfs/ce/rules/classify_bash_shells
echo 'path=/bin/tcsh,class=/rcfs/taskclass/shells' > 
/rcfs/ce/rules/classify_tcsh_shells
..

And class shares would control the fork rate of those shells:

echo 'res=numtasks,forkrate=1,forkrate_interval=1' > 
'/rcfs/taskclass/config'
echo 'res=numtasks,guarantee=1000,limit=5000' > '/rcfs/taskclass/shells'

No middleware necessary.

 

> CKRM is in part a generalization and descendent of what I call fair
> share schedulers.  For example, the fork hooks for CKRM include a
> forkrates controller, to slow down the rate of forking of tasks using
> too much resources.
> 
> No doubt the CKRM experts are already familiar with these, but for
> the possible benefit of other readers:
> 
>   UNICOS Resource Administration - Chapter 4. Fair-share Scheduler
>   
> http://oscinfo.osc.edu:8080/dynaweb/all/004-2302-001/@Generic__BookTextView/22883
> 
>   SHARE II -- A User Administration and Resource Control System for UNIX
>   http://www.c-side.com/c/papers/lisa-91.html
> 
>   Solaris Resource Manager White Paper
>   http://wwws.sun.com/software/resourcemgr/wp-mixed/
> 
>   ON THE PERFORMANCE IMPACT OF FAIR SHARE SCHEDULING
>   http://www.cs.umb.edu/~eb/goalmode/cmg2000final.htm
> 
>   A Fair Share Scheduler, J. Kay and P. Lauder
>   Communications of the ACM, January 1988, Volume 31, Number 1, pp 44-55.
> 
> The documentation that I've noticed (likely I've missed something)
> doesn't do an adequate job of making the case - providing the
> motivation and context essential to understanding this patch set.

The choice of algorithm is entirely up to the scheduler, memory
allocator, etc. CKRM currently provides an interface for reading share
values and does not impose any meaning on those shares -- that is the
role of the scheduler.

> Because CKRM provides an infrastructure for multiple controllers
> (limiting forks, memory allocation and network rates) and multiple
> classifiers and policies, its critical interfaces have rather
> generic and abstract names.  This makes it difficult for others to
> approach CKRM, reducing the rate of peer review by other Linux kernel
> developers, which is perhaps the key impediment to acceptance of CKRM.
> If anything, CKRM tends to be a little too abstract.

Generic and abstract names are appropriate for infrastructure that is
not tied to hardware. If you could be more specific I'd be able to
respond in less general and abstract terms.



> My notes from many months ago indicate something about a 128 CPU
> limit in CKRM.  I don't know why, nor if it still applies.  It is
> certainly a smaller limit than the systems I care about.

I haven't seen this limitation in the CKRM patches that went into -mm
and I'd like to look i