Re: [PATCH] Memory Resource Controller Add Boot Option

2008-02-25 Thread Hirokazu Takahashi
Hi,

> >>> I'll send out a prototype for comment.
> > 
> > Something like the patch below. The effects of cgroup_disable=foo are:
> > 
> > - foo doesn't show up in /proc/cgroups
> 
> Or we can print out the disable flag, maybe this will be better?
> Because we can distinguish from disabled and not compiled in from
> /proc/cgroups.

It would be neat if the disable flag /proc/cgroups can be cleared/set
on demand. It will depend on the implementation of each controller
whether it works or not.

> > - foo isn't auto-mounted if you mount all cgroups in a single hierarchy
> > - foo isn't visible as an individually mountable subsystem
> 
> You mentioned in a previous mail if we mount a disabled subsystem we
> will get an error. Here we just ignore the mount option. Which makes
> more sense ?
> 
> > 
> > As a result there will only ever be one call to foo->create(), at init
> > time; all processes will stay in this group, and the group will never be
> > mounted on a visible hierarchy. Any additional effects (e.g. not
> > allocating metadata) are up to the foo subsystem.
> > 
> > This doesn't handle early_init subsystems (their "disabled" bit isn't
> > set be, but it could easily be extended to do so if any of the
> > early_init systems wanted it - I think it would just involve some
> > nastier parameter processing since it would occur before the
> > command-line argument parser had been run.
> > 
> > include/linux/cgroup.h |1 +
> > kernel/cgroup.c|   29 +++--
> > 2 files changed, 28 insertions(+), 2 deletions(-)
> > 
> > Index: cgroup_disable-2.6.25-rc2-mm1/include/linux/cgroup.h
> > ===
> > --- cgroup_disable-2.6.25-rc2-mm1.orig/include/linux/cgroup.h
> > +++ cgroup_disable-2.6.25-rc2-mm1/include/linux/cgroup.h
> > @@ -256,6 +256,7 @@ struct cgroup_subsys {
> > void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);
> > int subsys_id;
> > int active;
> > +int disabled;
> > int early_init;
> > #define MAX_CGROUP_TYPE_NAMELEN 32
> > const char *name;
> > Index: cgroup_disable-2.6.25-rc2-mm1/kernel/cgroup.c
> > ===
> > --- cgroup_disable-2.6.25-rc2-mm1.orig/kernel/cgroup.c
> > +++ cgroup_disable-2.6.25-rc2-mm1/kernel/cgroup.c
> > @@ -790,7 +790,14 @@ static int parse_cgroupfs_options(char *
> > if (!*token)
> > return -EINVAL;
> > if (!strcmp(token, "all")) {
> > -opts->subsys_bits = (1 << CGROUP_SUBSYS_COUNT) - 1;
> > +/* Add all non-disabled subsystems */
> > +int i;
> > +opts->subsys_bits = 0;
> > +for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> > +struct cgroup_subsys *ss = subsys[i];
> > +if (!ss->disabled)
> > +opts->subsys_bits |= 1ul << i;
> > +}
> > } else if (!strcmp(token, "noprefix")) {
> > set_bit(ROOT_NOPREFIX, &opts->flags);
> > } else if (!strncmp(token, "release_agent=", 14)) {
> > @@ -808,7 +815,8 @@ static int parse_cgroupfs_options(char *
> > for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> > ss = subsys[i];
> > if (!strcmp(token, ss->name)) {
> > -set_bit(i, &opts->subsys_bits);
> > +if (!ss->disabled)
> > +set_bit(i, &opts->subsys_bits);
> > break;
> > }
> > }
> > @@ -2596,6 +2606,8 @@ static int proc_cgroupstats_show(struct
> > mutex_lock(&cgroup_mutex);
> > for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> > struct cgroup_subsys *ss = subsys[i];
> > +if (ss->disabled)
> > +continue;
> > seq_printf(m, "%s\t%lu\t%d\n",
> >ss->name, ss->root->subsys_bits,
> >ss->root->number_of_cgroups);
> > @@ -2991,3 +3003,16 @@ static void cgroup_release_agent(struct
> > spin_unlock(&release_list_lock);
> > mutex_unlock(&cgroup_mutex);
> > }
> > +
> > +static int __init cgroup_disable(char *str)
> > +{
> > +int i;
> > +for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> > +struct cgroup_subsys *ss = subsys[i];
> > +if (!strcmp(str, ss->name)) {
> > +ss->disabled = 1;
> > +break;
> > +}
> > +}
> > +}
> > +__setup("cgroup_disable=", cgroup_disable);
> > 
> > 
> >>
> >> Sure thing, if css has the flag, then it would nice. Could you wrap it
> >> up to say
> >> something like css_disabled(&mem_cgroup_subsys)
> >>
> >>
> > 
> > It's the subsys object rather than the css (cgroup_subsys_state).
> > 
> >  We could have something like:
> > 
> > #define cgroup_subsys_disabled(_ss) ((ss_)->disabled)
> > 
> > but I don't see that
> >  cgroup_subsys_disabled(&mem_cgroup_subsys)
> > is better than just putting
> > 
> >  mem_cgroup_subsys.disabled
> > 
> > Paul
> > 
> > 
> 
> 
--

Re: [dm-devel] [PATCH 0/2] dm-band: The I/O bandwidth controller: Overview

2008-01-24 Thread Hirokazu Takahashi
Hi,

> > > On Wed, Jan 23, 2008 at 09:53:50PM +0900, Ryo Tsuruta wrote:
> > > > Dm-band gives bandwidth to each job according to its weight, 
> > > > which each job can set its own value to.
> > > > At this time, a job is a group of processes with the same pid or pgrp 
> > > > or uid.
> > > 
> > > It seems to rely on 'current' to classify bios and doesn't do it until 
> > > the map
> > > function is called, possibly in a different process context, so it won't
> > > always identify the original source of the I/O correctly:
> > 
> > Yes, this should be mentioned in the document with the current 
> > implementation
> > as you pointed out.
> > 
> > By the way, I think once a memory controller of cgroup is introduced, it 
> > will
> > help to track down which cgroup is the original source.
> 
> do you mean to make this a part of the memory subsystem?

I just think if the memory subsystem is in front of us, we don't need to
reinvent the wheel.

But I don't have a concrete image how the interface between dm-band and
the memory subsystem should be designed yet. I'd be appreciate if some of
the cgroup developers give some ideas about it.

Thanks,
Hirokazu Takahashi.


> YAMAMOTO Takashi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] dm-band: The I/O bandwidth controller: Overview

2008-01-24 Thread Hirokazu Takahashi
Hi,

> Hi,
> 
> I believe this work is very important especially in the context of 
> virtual machines.  I think it would be more useful though implemented in 
> the context of the IO scheduler.  Since we already support a notion of 
> IO priority, it seems reasonable to add a notion of an IO cap.

I agree that what you proposed is the most straightforward approach.
Ryo and I have also investigated the CFQ scheduler that It will be possible
to enhance it to support bandwidth control with quite a few modification.
I think both approach have pros and cons.

At this time, we have chosen the device-mapper approach because:
 - it can work with any I/O scheduler. Some people will want use the NOOP
   scheduler against high-end storages.
 - only people who need I/O bandwidth control should use it.
 - it is independent to the I/O schedulers so that it will be easy to maintain.
 - it can keep the CFQ implementation simple.

The current the CFQ scheduler has some limitations if you want to control the
bandwidths. The scheduler only has seven priority levels, which also means it
has only seven classes. If you assign the same io-priority A to several VMs
--- virtual machines ---, these machines have to share the I/O bandwidth which
is assign to the io-priority A class. If other VM with io-priority B which is
lower than io-priority A and there is no other VM in the same io-priority B
class, VM in the io-priority B class may be able to use large bandwidth than
VMs in io-priority the A class.

I guess two level scheduling should be introduced in the CFQ scheduler if 
needed.
The one is to choose the best cgroup or job, and the other is to choose the
highest io-priority class.

There is another limitation that io-priority is global so that it affects
all the disks to access. It isn't allowed to have a job use several 
io-priorities
to access several disks respectively. I think "per disk io-priority" will be
required.

But the device-mapper approach also has a bad points.
It is hard to get the capabilities and configurations of the underlying devices
such as information of partitions or LUNs. So some configuration tools may
probably be required.

Thank you,
Hirokazu Takahashi.

> Regards,
> 
> Anthony Liguori
> 
> Ryo Tsuruta wrote:
> > Hi everyone,
> > 
> > I'm happy to announce that I've implemented a Block I/O bandwidth 
> > controller.
> > The controller is designed to be of use in a cgroup or virtual machine
> > environment. The current approach is that the controller is implemented as
> > a device-mapper driver.
> > 
> > What's dm-band all about?
> > 
> > Dm-band is an I/O bandwidth controller implemented as a device-mapper 
> > driver.
> > Several jobs using the same physical device have to share the bandwidth of
> > the device. Dm-band gives bandwidth to each job according to its weight, 
> > which each job can set its own value to.
> > 
> > At this time, a job is a group of processes with the same pid or pgrp or 
> > uid.
> > There is also a plan to make it support cgroup. A job can also be a virtual
> > machine such as KVM or Xen.
> > 
> >   +--+ +--+ +--+   +--+ +--+ +--+ 
> >   |cgroup| |cgroup| | the  |   | pid  | | pid  | | the  |  jobs
> >   |  A   | |  B   | |others|   |  X   | |  Y   | |others| 
> >   +--|---+ +--|---+ +--|---+   +--|---+ +--|---+ +--|---+   
> >   +--V+---V---+V---+   +--V+---V---+V---+   
> >   | group | group | default|   | group | group | default|  band groups
> >   |   |   |  group |   |   |   |  group | 
> >   +---+---++   +---+---++
> >   | band1  |   | band2  |  band devices
> >   +---|+   +---|+
> >   +---V--+-V+
> >   |  |  |
> >   |  sdb1|   sdb2   |  physical devices
> >   +--+--+
> > 
> > 
> > How dm-band works.
> > 
> > Every band device has one band group, which by default is called the default
> > group.
> > 
> > Band devices can also have extra band groups in them. Each band group
> > has a job to support and a weight. Proportional to the weight, dm-band gives
> > tokens to the group.
> > 
> > A group passes on I/O requests that its job issues to the underlying
> > layer so long as it has tokens left, while requests are blocked
> > if there aren't any tokens left in the group. One token is consumed each
> > time the group passes o

Re: [dm-devel] [PATCH 0/2] dm-band: The I/O bandwidth controller: Overview

2008-01-23 Thread Hirokazu Takahashi
Hi,

> On Wed, Jan 23, 2008 at 09:53:50PM +0900, Ryo Tsuruta wrote:
> > Dm-band gives bandwidth to each job according to its weight, 
> > which each job can set its own value to.
> > At this time, a job is a group of processes with the same pid or pgrp or 
> > uid.
> 
> It seems to rely on 'current' to classify bios and doesn't do it until the map
> function is called, possibly in a different process context, so it won't
> always identify the original source of the I/O correctly:

Yes, this should be mentioned in the document with the current implementation
as you pointed out.

By the way, I think once a memory controller of cgroup is introduced, it will
help to track down which cgroup is the original source.

> people need to take
> this into account when designing their group configuration and so this should
> be mentioned in the documentation.
>
> I've uploaded it here while we consider ways we might refine the architecture 
> and
> interfaces etc.:
> 
>   
> http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-add-band-target.patch
> 
> Alasdair
> -- 
> [EMAIL PROTECTED]

Thank you,
Hirokazu Takahashi.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS

2005-09-09 Thread Hirokazu Takahashi
Hi,

> magnus wrote:
> > Maybe it is possible to have an hierarchical model and keep the
> > framework simple and easy to understand while providing guarantees,
> 
> Dinakar's patches to use cpu_exclusive cpusets to define dynamic
> sched domains accomplish something like this.
> 
> What scheduler domains and resource control domains both need
> are non-overlapping subsets of the CPUs and/or Memory Nodes.
> 
> In the case of sched domains, you normally want the subsets
> to cover all the CPUs.  You want every CPU to have exactly
> one scheduler that is responsible for its scheduling.
> 
> In the case of resource control domains, you perhaps don't
> care if some CPUs or Memory Nodes have no particular resources
> constraints defined for them.  In that case, every CPU and
> every Memory Node maps to _either_ zero or one resource control
> domain.
> 
> Either way, a 'flat model' non-overlapping partitioning of the
> CPUs and/or Memory Nodes can be obtained from a hierarchical
> model (nested sets of subsets) by selecting some of the subsets
> that don't overlap ;).  In /dev/cpuset, this selection is normally
> made by specifying another boolean file (contains '0' or '1')
> that controls whether that cpuset is one of the selected subsets.

What do you think if you make cpusets for sched domain be able to
have their siblings, which have the same attribute and share
their resources between them.

I guess it would be simple.

Thanks,
Hirokazu Takahashi.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-24 Thread Hirokazu Takahashi
Hi,

> The following patch does not use MMX regsiters so that we don't have
> to worry about save/restore the FPU/MMX states.
> 
> What do you think?

I think __copy_user_zeroing_intel_nocache() should be followed by sfence
or mfence instruction to flush the data.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: math_state_restore() question

2005-08-17 Thread Hirokazu Takahashi
Hi,

Just take a look at __switch_to(), where __unlazy_fpu() is called.

> Hi,
> 
> I have a quick question.
> 
> The math_state_restore() restores the FPU/MMX/XMM states.
> However where do we save the previous task's states if it is necessary?
> 
> asmlinkage void math_state_restore(struct pt_regs regs)
> {
> struct thread_info *thread = current_thread_info();
> struct task_struct *tsk = thread->task;
> 
> clts(); /* Allow maths ops (or we recurse) */
> if (!tsk_used_math(tsk))
> init_fpu(tsk);
> restore_fpu(tsk);
> thread->status |= TS_USEDFPU;   /* So we fnsave on switch_to() */
> }
> 
> Thanks in advance,
>   Hiro
> -- 
> Hiro Yoshioka
> mailto:hyoshiok at miraclelinux.com
> -


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-16 Thread Hirokazu Takahashi
Hi,

> > > > My code does nothing do it.
> > > > 
> > > > I need a volunteer to implement it.
> > > 
> > > it's actually not too hard; all you need is to use SSE and not MMX; and
> > > then just store sse register you're overwriting on the stack or so...
> > 
> > oh, really? Does the linux kernel take care of
> > SSE save/restore on a task switch?
> 
> not on kernel entry afaik.
> However just save the register on the stack and put it back at the
> end...

I think this have to be done in the pagefault handlers.


Thanks,
Hirokazu Takahashi.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-16 Thread Hirokazu Takahashi
Hi

> > > My code does nothing do it.
> > > 
> > > I need a volunteer to implement it.
> > 
> > it's actually not too hard; all you need is to use SSE and not MMX; and
> > then just store sse register you're overwriting on the stack or so...
> 
> oh, really? Does the linux kernel take care of
> SSE save/restore on a task switch?

noop!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-15 Thread Hirokazu Takahashi
Hi,

BTW, what are you going to do with the page-faults which may happen
during __copy_user_zeroing_nocache()? The current process may be blocked
in the handler for a while and get FPU registers polluted.
kernel_fpu_begin() won't help the case. This is another issue, though.

> > Thanks.
> > 
> > filemap_copy_from_user() calls __copy_from_user_inatomic() calls
> > __copy_from_user_ll().
> > 
> > I'll look at the code.
> 
> The following is a quick hack of cache aware implementation
> of __copy_from_user_ll() and __copy_from_user_inatomic()
> 
> __copy_from_user_ll_nocache() and __copy_from_user_inatomic_nocache()
> 
> filemap_copy_from_user() calles __copy_from_user_inatomic_nocache()
> instead of __copy_from_user_inatomic() and reduced cashe miss.
> 
> The first column is the cache reference (memory access) and the
> third column is the 3rd level cache miss.
> 
> The following example shows the L3 cache miss is reduced from 37410 to 107.
> 
> 2.6.12.4 nocache version
> Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) 
> with a unit mask of 0x3f (multiple flags) count 3000
> Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) 
> with a unit mask of 0x200 (read 3rd level cache miss) count 3000
> samples  %samples  % app name   symbol name
> 1204426.4106  1070.5620  vmlinux__copy_user_zeroing_nocache
> 80049 4.2606  5783.0357  vmlinuxjournal_add_journal_head
> 69194 3.6829  1540.8088  vmlinuxjournal_dirty_metadata
> 67059 3.5692  78 0.4097  vmlinux__find_get_block
> 64145 3.4141  32 0.1681  vmlinuxjournal_put_journal_head
> pattern9-0-cpu4-0-08161154/summary.out
> 
> The 2.6.12.4 original version is
> Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) 
> with a unit mask of 0x3f (multiple flags) count 3000
> Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) 
> with a unit mask of 0x200 (read 3rd level cache miss) count 3000
> samples  %samples  % app name   symbol name
> 1206467.4680  37410 62.3355  vmlinux__copy_from_user_ll
> 79508 4.9215  9031.5046  vmlinux_spin_lock
> 65526 4.0561  8731.4547  vmlinuxjournal_add_journal_head
> 59296 3.6704  1290.2149  vmlinux__find_get_block
> 58647 3.6302  2150.3582  vmlinuxjournal_dirty_metadata
> 
> What do you think?
> 
> Hiro
> 
> diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.nocache/Makefile
> --- linux-2.6.12.4.orig/Makefile  2005-08-12 14:37:59.0 +0900
> +++ linux-2.6.12.4.nocache/Makefile   2005-08-16 10:22:31.0 +0900
> @@ -1,7 +1,7 @@
>  VERSION = 2
>  PATCHLEVEL = 6
>  SUBLEVEL = 12
> -EXTRAVERSION = .4.orig
> +EXTRAVERSION = .4.nocache
>  NAME=Woozy Numbat
>  
>  # *DOCUMENTATION*
> diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy.c 
> linux-2.6.12.4.nocache/arch/i386/lib/usercopy.c
> --- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c  2005-08-05 
> 16:04:37.0 +0900
> +++ linux-2.6.12.4.nocache/arch/i386/lib/usercopy.c   2005-08-16 
> 10:49:59.0 +0900
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -511,6 +512,110 @@
>   : "memory");\
>  } while (0)
>  
> +/* Non Temporal Hint version of mmx_memcpy */
> +/* It is cache aware   */
> +/* [EMAIL PROTECTED]   */
> +static unsigned long 
> +__copy_user_zeroing_nocache(void *to, const void *from, size_t len)
> +{
> +/* Note! gcc doesn't seem to align stack variables properly, so we
> + * need to make use of unaligned loads and stores.
> + */
> + void *p;
> + int i;
> +
> + if (unlikely(in_interrupt())){
> + __copy_user_zeroing(to, from, len);
> + return len;
> + }
> +
> + p = to;
> + i = len >> 6; /* len/64 */
> +
> +kernel_fpu_begin();
> +
> + __asm__ __volatile__ (
> + "1: prefetchnta (%0)\n" /* This set is 28 bytes */
> + "   prefetchnta 64(%0)\n"
> + "   prefetchnta 128(%0)\n"
> + "   prefetchnta 192(%0)\n"
> + "   prefetchnta 256(%0)\n"
> + "2:  \n"
> + ".section .fixup, \"ax\"\n"
> + "3: movw $0x1AEB, 1b\n" /* jmp on 26 bytes */
> + "   jmp 2b\n"
> + ".previous\n"
> + ".section __ex_table,\"a\"\n"
> + "   .align 4\n"
> + "   .long 1b, 3b\n"
> + ".previous"
> + : : "r" (from) );
> + 
> + for(; i>5; i--)
> + {
> + __asm__ __volatile__ (
> + "1:  prefetchnta 320(%0)\n"
> + "2:  movq (%0), %%mm0\n"
> + "  movq 8(%0), %%mm1\n"
> + "  movq 16(%0), %%mm2\n"
> + "  movq 24(%0), %%m

Re: 2.6.13-rc3-mm1 (ckrm)

2005-07-18 Thread Hirokazu Takahashi
Hi,

> > What, in your opinion, makes it "obviously unmergeable"?

Controlling resource assignment, I think that concept is good.
But the design is another matter that it seems somewhat overkilled
with the current CKRM.

> I suspect that the main problem is that this patch is not a mainstream
> kernel feature that will gain multiple uses, but rather provides
> support for a specific vendor middleware product used by that
> vendor and a few closely allied vendors.  If it were smaller or
> less intrusive, such as a driver, this would not be a big problem.
> That's not the case.

I believe this feature would also make desktop users happier -- controlling
X-server, mpeg player, video capturing and all that -- if the code
becomes much simpler and easier to use.

> A major restructuring of this patch set could be considered,  This
> might involve making the metric tools (that monitor memory, fork
> and network usage rates per task) separate patches useful for other
> purposes.  It might also make the rate limiters in fork, alloc and
> network i/o separately useful patches.  I mean here genuinely useful
> and understandable in their own right, independent of some abstract
> CKRM framework.

That makes sense.

> Though hints have been dropped, I have not seen any public effort to
> integrate CKRM with either cpusets or scheduler domains or process
> accounting.  By this I don't mean recoding cpusets using the CKRM
> infrastructure; that proposal received _extensive_ consideration
> earlier, and I am as certain as ever that it made no sense.  Rather I
> could imagine the CKRM folks extending cpusets to manage resources
> on a per-cpuset basis, not just on a per-task or task class basis.
> Similarly, it might make sense to use CKRM to manage resources on
> a per-sched domain basis, and to integrate the resource tracking
> of CKRM with the resource tracking needs of system accounting.

>From a standpoint of the users, CKRM and CPUSETS should be managed
seamlessly through the same interface though I'm not sure whether
your idea is the best yet.


Thanks,
Hirokazu Takahashi.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

2005-02-04 Thread Hirokazu Takahashi
Hi,

> > Hi Eric,
> > 
> > > > Hi Vivek and Eric,
> > > > 
> > > > IMHO, why don't we swap not only the contents of the top 640K
> > > > but also kernel working memory for kdump kernel?
> > > > 
> > > > I guess this approach has some good points.
> > > > 
> > > >  1.Preallocating reserved area is not mandatory at boot time.
> > > >And the reserved area can be distributed in small pieces
> > > >like original kexec does.
> > > > 
> > > >  2.Special linking is not required for kdump kernel.
> > > >Each kdump kernel can be linked in the same way,
> > > >where the original kernel exists.
> > > > 
> > > > Am I missing something?
> > > 
> > > Preallocating the reserved area is largely to keep it from
> > > being the target of DMA accesses.  Since we are not able
> > > to shutdown any of the drivers in the primary kernel running
> > > in a normal swath of memory sounds like a good way to get
> > > yourself stomped at the worst possible time.
> > 
> > So what do you think my another idea?
> 
> I have proposed it.  I think ia64 already does that.
> It has been pointed that the PowerPC kernel occasionally runs
> with the mmu turned off. So it is not a technique the is 100%
> portable.

I see you have.
And MIPS CPUs doesn't allow kernel pages to be remapped either.

> > I think we can always make a kdump kernel mapped to the same virtual
> > address. So we will be free from caring about the physical address
> > where the kdump kernel is loaded.
> > 
> > I believe the memsection functionality which LHMS project is working
> > on would help this.
> 
> You don't need anything fancy except to build the page tables
> during bootup.  However there are a few potential gotchas
> with respect to using large pages, that can give 4MiB or
> greater alignment restrictions on the kernel.  Code wise
> the gotcha is moving the kernel's .text section into what
> is essentially the vmalloc portion of the address space.
> For x86_64 the kernels virtual address is already decoupled from the
> physical addresses, so it is probably easier.

I know we can place the kernel in any address though there
exist some exceptions.

I know mapping kernel pages to the same virtual address only helps
to avoid caring about physical addresses or vmalloc'ed addresses
when linking the kernel. I think it wouldn't be bad idea in many
architectures. I prefer it rather than linking the kernel for each
system.

> Most of this just results in easier management between the pieces.
> Which is a good thing.  However at the moment I don't think it
> simplifies any of the core problems.  I still need to reserve
> a large hunk of physical address space early on before any
> DMA transactions are setup to hold the new kernel.

I agree that my idea is not essential at the moment.

> So while I am happy to see patches that improve this I don't
> actually care right now.

ok.

> Eric
> 

Thanks,
Hirokazu Takahashi.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

2005-02-03 Thread Hirokazu Takahashi
Hi Eric,

> > Hi Vivek and Eric,
> > 
> > IMHO, why don't we swap not only the contents of the top 640K
> > but also kernel working memory for kdump kernel?
> > 
> > I guess this approach has some good points.
> > 
> >  1.Preallocating reserved area is not mandatory at boot time.
> >And the reserved area can be distributed in small pieces
> >like original kexec does.
> > 
> >  2.Special linking is not required for kdump kernel.
> >Each kdump kernel can be linked in the same way,
> >where the original kernel exists.
> > 
> > Am I missing something?
> 
> Preallocating the reserved area is largely to keep it from
> being the target of DMA accesses.  Since we are not able
> to shutdown any of the drivers in the primary kernel running
> in a normal swath of memory sounds like a good way to get
> yourself stomped at the worst possible time.

So what do you think my another idea?

I think we can always make a kdump kernel mapped to the same virtual
address. So we will be free from caring about the physical address
where the kdump kernel is loaded.

I believe the memsection functionality which LHMS project is working
on would help this.

+
|
|
(user space)
|
|
  physical  | virtual
  memory| space
 +  +
 |  |
 |  |
 |  |
 + .+
original |   .  | map kdump kernel here
kernel   | .|
 |   .  |
 | .   .+
 +   .   .  |
 | .   .|
 +   .  |
  kdump  | .|
  kernel |   .  |
 | .|
 +      |
 |  |
 |  |
 |  |



Thanks,
Hirokazu Takahashi.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

2005-02-03 Thread Hirokazu Takahashi
Hi Vivek, 

> > Hi Vivek and Eric,
> > 
> > IMHO, why don't we swap not only the contents of the top 640K
> > but also kernel working memory for kdump kernel?
> 
> 
> Initial patches of kdump had adopted the same approach but given the
> fact devices are not stopped during transition to new kernel after a
> panic, it carried inherent risk of some DMA going on and corrupting the
> new kernel/data structures. Hence the idea of running the kernel from a
> reserved location came up. This should be DMA safe as long as DMA is not
> misdirected.

I see, that makes sense.
But I'm not sure yet that it's safe to access the top of 640MB.
I wonder how kmalloc(GFP_DMA) works in a kdump kernel.

Thanks,
Hirokazu Takahashi.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

2005-02-02 Thread Hirokazu Takahashi
Hi Vivek and Eric,

IMHO, why don't we swap not only the contents of the top 640K
but also kernel working memory for kdump kernel?

I guess this approach has some good points.

 1.Preallocating reserved area is not mandatory at boot time.
   And the reserved area can be distributed in small pieces
   like original kexec does.

 2.Special linking is not required for kdump kernel.
   Each kdump kernel can be linked in the same way,
   where the original kernel exists.

Am I missing something?


 physical memory
   +---+
   | 640K  +
   |...|   |
   |   | copy
   +---+   |
   |   |   |
   |original<-+|
   |kernel |  ||
   |   |  ||
   |...|  ||
   |   |  ||
   |   |  ||
   |   | swap  |
   |   |  ||
   +---+  ||
   |reserved<--+
   |area   |  |
   |   |  |
   |kdump  |<-+
   |kernel |
   +---+
   |   |
   |   |
   |   |
   +---+



> Hi Eric,
> 
> It looks like we are looking at things a little differently. I
> see a portion of the picture in your mind, but obviously not 
> entirely.
> 
> Perhaps, we need to step back and iron out in specific terms what 
> the interface between the two kernels should be in the crash dump
> case, and the distribution of responsibility between kernel, user space
> and the user. 
> 
> [BTW, the patch was intended as a step in development up for
> comment early enough to be able to get agreement on the interface
> and think issues through to more completeness before going 
> too far. Sorry, if that wasn't apparent.]
> 
> When you say "evil intermingling", I'm guessing you mean the
> "crashbackup=" boot parameter ? If so, then yes, I agree it'd
> be nice to find a way around it that doesn't push hardcoding
> elsewhere.
> 
> Let me explain the interface/approach I was looking at.
> 
> 1.First kernel reserves some area of memory for crash/capture kernel as
> specified by [EMAIL PROTECTED] boot time parameter.
> 
> 2.First kernel marks the top 640K of this area as backup area. (If
> architecture needs it.) This is sort of a hardcoding and probably this
> space reservation can be managed from user space as well as mentioned by
> you in this mail below.
> 
> 3. Location of backup region is exported through /proc/iomem which can
> be read by user space utility to pass this information to purgatory code
> to determine where to copy the first 640K.
> 
> Note that we do not make any additional reservation for the 
> backup region. We carve this out from the top of the already 
> reserved region and export it through /proc/iomem so that 
> the user space code and the capture kernel code need not 
> make any assumptions about where this region is located.
> 
> 4. Once the capture kernel boots, it needs to know the location of
> backup region for two purposes.
> 
> a. It should not overwrite the backup region.
> 
> b. There needs to be a way for the capture tool to access the original
>contents of the backed up region
> 
> Boot time parameter [EMAIL PROTECTED] has been provided to pass this
> information to capture kernel. This parameter is valid only for capture
> kernel and becomes effective only if CONFIG_CRASH_DUMP is enabled.
> 
> 
> > What is wrong with user space doing all of the extra space
> > reservation?
> 
> Just for clarity, are you suggesting kexec-tools creating an additional
> segment for the backup region and pass the information to kernel.
> 
> There is no problem in doing reservation from user space except
> one. How does the user and in-turn capture kernel come to know the
> location of backup region, assuming that the user is going to provide
> the exactmap for capture kernel to boot into.
> 
> Just a thought, is it  a good idea for kexec-tools to be creating and
> passing memmap parameters doing appropriate adjustment for backup
> region.
> 
> I had another question. How is the starting location of elf headers 
> communicated to capture tool? Is parameter segment a good idea? or 
> some hardcoding? 
> 
> Another approach can be that backup area information is encoded in elf
> headers and capture kernel is booted with modified memmap (User gets
> backup region information from /proc/iomem) and capture tool can
> extract backup area information from elf headers as stored by first
> kernel.
> 
> Could you please elaborate a little more on what aspect of your view
> differs from the above.
> 
> Thanks
> Vivek

Thaks,
Hirokazu Takahashi.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hugepages demand paging V2 [3/8]: simple numa compatible allocator

2005-02-02 Thread Hirokazu Takahashi
Hi Christoph,


> Changelog
>   * Simple NUMA compatible allocation of hugepages in the nearest node
> 
> Index: linux-2.6.9/mm/hugetlb.c
> ===
> --- linux-2.6.9.orig/mm/hugetlb.c 2004-10-22 13:28:27.0 -0700
> +++ linux-2.6.9/mm/hugetlb.c  2004-10-25 16:56:22.0 -0700
> @@ -32,14 +32,17 @@
>  {
>   int nid = numa_node_id();
>   struct page *page = NULL;
> -
> - if (list_empty(&hugepage_freelists[nid])) {
> - for (nid = 0; nid < MAX_NUMNODES; ++nid)
> - if (!list_empty(&hugepage_freelists[nid]))
> - break;
> + struct zonelist *zonelist = NODE_DATA(nid)->node_zonelists;


I think the previous line should be replaced with

struct zonelist *zonelist = NODE_DATA(nid)->node_zonelists + __GFP_HIGHMEM;

because NODE_DATA(nid)->node_zonelists means a zonelist for __GFP_DMA zones.
__GFP_HIGHMEM would be suitable for hugetlbpages.


> + struct zone **zones = zonelist->zones;
> + struct zone *z;
> + int i;
> +
> + for(i=0; (z = zones[i])!= NULL; i++) {
> + nid = z->zone_pgdat->node_id;
> + if (!list_empty(&hugepage_freelists[nid]))
> + break;
>   }
> - if (nid >= 0 && nid < MAX_NUMNODES &&
> - !list_empty(&hugepage_freelists[nid])) {
> + if (z) {
>   page = list_entry(hugepage_freelists[nid].next,
> struct page, lru);
>   list_del(&page->lru);
> 
> -

Thanks,
Hirokazu Takahashi.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hugepages demand paging V2 [1/8]: hugetlb fault handler

2005-01-18 Thread Hirokazu Takahashi
Hi,

> ChangeLog
>   * provide huge page fault handler and related things



> Index: linux-2.6.9/fs/hugetlbfs/inode.c
> ===
> --- linux-2.6.9.orig/fs/hugetlbfs/inode.c 2004-10-18 14:55:07.0 
> -0700
> +++ linux-2.6.9/fs/hugetlbfs/inode.c  2004-10-21 14:50:14.0 -0700
> @@ -79,10 +79,6 @@
>   if (!(vma->vm_flags & VM_WRITE) && len > inode->i_size)
>   goto out;
> 
> - ret = hugetlb_prefault(mapping, vma);
> - if (ret)
> - goto out;
> -
>   if (inode->i_size < len)
>   inode->i_size = len;
>  out:

hugetlbfs_file_mmap() may fail with a weird error, as it returns
uninitialized variable "ret".


Thanks.
Hirokazu Takahashi.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/