[Devel] Re: C/R and stdio redirection

2010-10-05 Thread Sukadev Bhattiprolu
Greg Kurz [gk...@fr.ibm.com] wrote:
| On Tue, 2010-09-07 at 13:03 -0700, Sukadev Bhattiprolu wrote:
| > Suppose we create a container and redirect its stdout/stderr as follows:
| > 
| > lxc-execute -name foo -- /path/to/app > /tmp/xyz.out 2>&1
| > 
| > If we attempt to checkpoint the container 'foo', we fail bc one of the
| > fds in the application refers to /tmp/xyz.out, which is also in use
| > outside the container (specifically sys_checkpoint() fails due to the
| > "alien mount ns" check in ckpt_fill_fname()).
| > 
| > It can be argued, 'foo' is not a strict container (since it shares the
| > fd with another container).  For this reason, we currently need the
| > CHECKPOINT_SUBTREE flag in lxc-checkpoint.
| > 
| > We initially thought that solving mount-namespaces will solve this, but
| > realized that they are both separate problems. Mount-namespace C/R addresses
| > preserving the mounts within the container and /tmp/xyz.out is outside
| > the container.
| > 
| > So if an application container needs to redirect stdio as above, we should
| > either 
| > a) disable/ignore the alien-mount-ns check or 
| > 
| > b) try and start the application something like:
| > 
| > $ cat /tmp/wrapper
| > /path/to/app > /tmp/xyz.out 2>&1
| > 
| > $ lxc-execute --name foo --  /tmp/wrapper
| > 
| > with the difference being /tmp/xyz.out is now inside the container's /tmp
| > filesystem rather than in the parent container.
| > 
| > Maybe we can go with approach 'a' above only if CHECKPOINT_SUBTREE is also
| > set - we had discussed this before and considered it hacky.
| > 
| > Or are there other solutions to this stdio redirection issue ?
| > 
| 
| To be more accurate, this issue is about fd leaking from a parent
| container to its descendants. The fd numbers may be anything else than
| 0,1 or 2 and the underlying files may be regular files, pipes,
| sockets... For example, in the HPC world, stdio are often sockets
| inheritated from a rshd like daemon.

I agree that fd substitution is the right way to go.

However, Matt Helsley and I were discussing this and wondered if we should
ignore the redirection and expect to user to specify it during restart.

i.e if container was created like this:

lxc-execute -name foo -- /path/to/app > /tmp/xyz.out 2>&1

and checkpointed, can we expect the user to restart it like this ?

lxc-restart --name foo --statefile ckpt.img >> /tmp/xyz.out 

i.e user has to redo the redirection or the output would go to stdout.

Doing this would somehow seem to match a (bogus container) like:

lxc-execute --name foo -- /path/to/app | sort

If this container is checkpointed/restarted, we can't really redirect
the output of the app to 'sort' right ? So expecting the user to
redo the redirection on restart would treat both redirections ('>'
and '|') in a consistent way ?

Sukadev
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 00/10] memcg: per cgroup dirty page accounting

2010-10-05 Thread Balbir Singh
* Greg Thelen  [2010-10-03 23:57:55]:

> This patch set provides the ability for each cgroup to have independent dirty
> page limits.
> 
> Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
> page cache used by a cgroup.  So, in case of multiple cgroup writers, they 
> will
> not be able to consume more than their designated share of dirty pages and 
> will
> be forced to perform write-out if they cross that limit.
> 
> These patches were developed and tested on mmotm 2010-09-28-16-13.  The 
> patches
> are based on a series proposed by Andrea Righi in Mar 2010.
> 
> Overview:
> - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
>   unstable.
> - Extend mem_cgroup to record the total number of pages in each of the 
>   interesting dirty states (dirty, writeback, unstable_nfs).  
> - Add dirty parameters similar to the system-wide  /proc/sys/vm/dirty_*
>   limits to mem_cgroup.  The mem_cgroup dirty parameters are accessible
>   via cgroupfs control files.
> - Consider both system and per-memcg dirty limits in page writeback when
>   deciding to queue background writeback or block for foreground writeback.
> 
> Known shortcomings:
> - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
>   writeback dirty inodes.  Bdi writeback considers inodes from any cgroup, not
>   just inodes contributing dirty pages to the cgroup exceeding its limit.  
> 
> Performance measurements:
> - kernel builds are unaffected unless run with a small dirty limit.
> - all data collected with CONFIG_CGROUP_MEM_RES_CTLR=y.
> - dd has three data points (in secs) for three data sizes (100M, 200M, and 
> 1G).  
>   As expected, dd slows when it exceed its cgroup dirty limit.
> 
>kernel_build  dd
> mmotm 2:370.18, 0.38, 1.65
>   root_memcg
> 
> mmotm 2:370.18, 0.35, 1.66
>   non-root_memcg
> 
> mmotm+patches 2:370.18, 0.35, 1.68
>   root_memcg
> 
> mmotm+patches 2:370.19, 0.35, 1.69
>   non-root_memcg
> 
> mmotm+patches 2:370.19, 2.34, 22.82
>   non-root_memcg
>   150 MiB memcg dirty limit
> 
> mmotm+patches 3:581.71, 3.38, 17.33
>   non-root_memcg
>   1 MiB memcg dirty limit
>

Greg, could you please try the parallel page fault test. Could you
look at commit 0c3e73e84fe3f64cf1c2e8bb4e91e8901cbcdc38 and
569b846df54ffb2827b83ce3244c5f032394cba4 for examples. 

-- 
Three Cheers,
Balbir
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 02/10] memcg: document cgroup dirty memory interfaces

2010-10-05 Thread Daisuke Nishimura
On Sun,  3 Oct 2010 23:57:57 -0700
Greg Thelen  wrote:

> Document cgroup dirty memory interfaces and statistics.
> 
> Signed-off-by: Andrea Righi 
> Signed-off-by: Greg Thelen 

I think you will change "nfs" to "nfs_unstable", but anyway,

Acked-by: Daisuke Nishimura 

Thanks
Daisuke Nishimura.

> ---
>  Documentation/cgroups/memory.txt |   37 +
>  1 files changed, 37 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt 
> b/Documentation/cgroups/memory.txt
> index 7781857..eab65e2 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -385,6 +385,10 @@ mapped_file  - # of bytes of mapped file (includes 
> tmpfs/shmem)
>  pgpgin   - # of pages paged in (equivalent to # of charging 
> events).
>  pgpgout  - # of pages paged out (equivalent to # of uncharging 
> events).
>  swap - # of bytes of swap usage
> +dirty- # of bytes that are waiting to get written back to 
> the disk.
> +writeback- # of bytes that are actively being written back to the disk.
> +nfs  - # of bytes sent to the NFS server, but not yet committed to
> + the actual storage.
>  inactive_anon- # of bytes of anonymous memory and swap cache memory 
> on
>   LRU list.
>  active_anon  - # of bytes of anonymous and swap cache memory on active
> @@ -453,6 +457,39 @@ memory under it will be reclaimed.
>  You can reset failcnt by writing 0 to failcnt file.
>  # echo 0 > .../memory.failcnt
>  
> +5.5 dirty memory
> +
> +Control the maximum amount of dirty pages a cgroup can have at any given 
> time.
> +
> +Limiting dirty memory is like fixing the max amount of dirty (hard to 
> reclaim)
> +page cache used by a cgroup.  So, in case of multiple cgroup writers, they 
> will
> +not be able to consume more than their designated share of dirty pages and 
> will
> +be forced to perform write-out if they cross that limit.
> +
> +The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.  
> It
> +is possible to configure a limit to trigger both a direct writeback or a
> +background writeback performed by per-bdi flusher threads.  The root cgroup
> +memory.dirty_* control files are read-only and match the contents of
> +the /proc/sys/vm/dirty_* files.
> +
> +Per-cgroup dirty limits can be set using the following files in the cgroupfs:
> +
> +- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage 
> of
> +  cgroup memory) at which a process generating dirty pages will itself start
> +  writing out dirty data.
> +
> +- memory.dirty_bytes: the amount of dirty memory (expressed in bytes) in the
> +  cgroup at which a process generating dirty pages will start itself writing 
> out
> +  dirty data.
> +
> +- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
> +  (expressed as a percentage of cgroup memory) at which background writeback
> +  kernel threads will start writing out dirty data.
> +
> +- memory.dirty_background_bytes: the amount of dirty memory (expressed in 
> bytes)
> +  in the cgroup at which background writeback kernel threads will start 
> writing
> +  out dirty data.
> +
>  6. Hierarchy support
>  
>  The memory controller supports a deep hierarchy and hierarchical accounting.
> -- 
> 1.7.1
> 
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 03/10] memcg: create extensible page stat update routines

2010-10-05 Thread Greg Thelen
Minchan Kim  writes:

> On Wed, Oct 6, 2010 at 4:59 AM, Greg Thelen  wrote:
>> Minchan Kim  writes:
>>
>>> On Sun, Oct 03, 2010 at 11:57:58PM -0700, Greg Thelen wrote:
 Replace usage of the mem_cgroup_update_file_mapped() memcg
 statistic update routine with two new routines:
 * mem_cgroup_inc_page_stat()
 * mem_cgroup_dec_page_stat()

 As before, only the file_mapped statistic is managed.  However,
 these more general interfaces allow for new statistics to be
 more easily added.  New statistics are added with memcg dirty
 page accounting.

 Signed-off-by: Greg Thelen 
 Signed-off-by: Andrea Righi 
 ---
  include/linux/memcontrol.h |   31 ---
  mm/memcontrol.c            |   17 -
  mm/rmap.c                  |    4 ++--
  3 files changed, 38 insertions(+), 14 deletions(-)

 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index 159a076..7c7bec4 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
 @@ -25,6 +25,11 @@ struct page_cgroup;
  struct page;
  struct mm_struct;

 +/* Stats that can be updated by kernel. */
 +enum mem_cgroup_write_page_stat_item {
 +    MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
 +};
 +
>>>
>>> mem_cgrou_"write"_page_stat_item?
>>> Does "write" make sense to abstract page_state generally?
>>
>> First I will summarize the portion of the design relevant to this
>> comment:
>>
>> This patch series introduces two sets of memcg statistics.
>> a) the writable set of statistics the kernel updates when pages change
>>   state (example: when a page becomes dirty) using:
>>     mem_cgroup_inc_page_stat(struct page *page,
>>                                enum mem_cgroup_write_page_stat_item idx)
>>     mem_cgroup_dec_page_stat(struct page *page,
>>                                enum mem_cgroup_write_page_stat_item idx)
>>
>> b) the read-only set of statistics the kernel queries to measure the
>>   amount of dirty memory used by the current cgroup using:
>>     s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item)
>>
>>   This read-only set of statistics is set of a higher level conceptual
>>   counters.  For example, MEMCG_NR_DIRTYABLE_PAGES is the sum of the
>>   counts of pages in various states (active + inactive).  mem_cgroup
>>   exports this value as a higher level counter rather than individual
>>   counters (active & inactive) to minimize the number of calls into
>>   mem_cgroup_page_stat().  This avoids extra calls to cgroup tree
>>   iteration with for_each_mem_cgroup_tree().
>>
>> Notice that each of the two sets of statistics are addressed by a
>> different type, mem_cgroup_{read vs write}_page_stat_item.
>>
>> This particular patch (memcg: create extensible page stat update
>> routines) introduces part of this design.  A later patch I emailed
>> (memcg: add dirty limits to mem_cgroup) added
>> mem_cgroup_read_page_stat_item.
>>
>>
>> I think the code would read better if I renamed
>> enum mem_cgroup_write_page_stat_item to
>> enum mem_cgroup_update_page_stat_item.
>>
>> Would this address your concern
>
> Thanks for the kind explanation.
> I understand your concept.
>
> I think you makes update and query as completely different level
> abstraction but you could use similar terms.
> Even the terms(write VS read) make me more confusing.
>
> How about renaming following as?
>
> 1. mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
> 2. mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
>
> At least it looks to be easy for me to understand the code.
> But it's just my preference. If others think your semantic is more
> desirable, I am not against it strongly.

I think your suggestion is good.  I will include it in the next revision
of the patch series.

> Thanks, Greg.
>
>>
>> --
>> Greg
>>
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 01/10] memcg: add page_cgroup flags for dirty page tracking

2010-10-05 Thread Daisuke Nishimura
On Sun,  3 Oct 2010 23:57:56 -0700
Greg Thelen  wrote:

> Add additional flags to page_cgroup to track dirty pages
> within a mem_cgroup.
> 
> Signed-off-by: KAMEZAWA Hiroyuki 
> Signed-off-by: Andrea Righi 
> Signed-off-by: Greg Thelen 

Acked-by: Daisuke Nishimura 

Thanks,
Daisuke Nishimura.

> ---
>  include/linux/page_cgroup.h |   23 +++
>  1 files changed, 23 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 5bb13b3..b59c298 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -40,6 +40,9 @@ enum {
>   PCG_USED, /* this object is in use. */
>   PCG_ACCT_LRU, /* page has been accounted for */
>   PCG_FILE_MAPPED, /* page is accounted as "mapped" */
> + PCG_FILE_DIRTY, /* page is dirty */
> + PCG_FILE_WRITEBACK, /* page is under writeback */
> + PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
>   PCG_MIGRATION, /* under page migration */
>  };
>  
> @@ -59,6 +62,10 @@ static inline void ClearPageCgroup##uname(struct 
> page_cgroup *pc)  \
>  static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
>   { return test_and_clear_bit(PCG_##lname, &pc->flags);  }
>  
> +#define TESTSETPCGFLAG(uname, lname) \
> +static inline int TestSetPageCgroup##uname(struct page_cgroup *pc)   \
> + { return test_and_set_bit(PCG_##lname, &pc->flags);  }
> +
>  TESTPCGFLAG(Locked, LOCK)
>  
>  /* Cache flag is set only once (at allocation) */
> @@ -80,6 +87,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
>  CLEARPCGFLAG(FileMapped, FILE_MAPPED)
>  TESTPCGFLAG(FileMapped, FILE_MAPPED)
>  
> +SETPCGFLAG(FileDirty, FILE_DIRTY)
> +CLEARPCGFLAG(FileDirty, FILE_DIRTY)
> +TESTPCGFLAG(FileDirty, FILE_DIRTY)
> +TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
> +TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
> +
> +SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
> +CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
> +TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
> +
> +SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +
>  SETPCGFLAG(Migration, MIGRATION)
>  CLEARPCGFLAG(Migration, MIGRATION)
>  TESTPCGFLAG(Migration, MIGRATION)
> -- 
> 1.7.1
> 
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 10/10] memcg: check memcg dirty limits in page writeback

2010-10-05 Thread Minchan Kim
On Mon, Oct 4, 2010 at 3:58 PM, Greg Thelen  wrote:
> If the current process is in a non-root memcg, then
> global_dirty_limits() will consider the memcg dirty limit.
> This allows different cgroups to have distinct dirty limits
> which trigger direct and background writeback at different
> levels.
>
> Signed-off-by: Andrea Righi 
> Signed-off-by: Greg Thelen 
> ---
>  mm/page-writeback.c |   87 
> ++-
>  1 files changed, 72 insertions(+), 15 deletions(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index a0bb3e2..c1db336 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -180,7 +180,7 @@ static unsigned long highmem_dirtyable_memory(unsigned 
> long total)
>  * Returns the numebr of pages that can currently be freed and used
>  * by the kernel for direct mappings.
>  */
> -static unsigned long determine_dirtyable_memory(void)
> +static unsigned long get_global_dirtyable_memory(void)
>  {
>        unsigned long x;
>
> @@ -192,6 +192,58 @@ static unsigned long determine_dirtyable_memory(void)
>        return x + 1;   /* Ensure that we never return 0 */
>  }
>

Just a nitpick.
You seem to like get_ name.
But I think it's a redundant and just makes function name longer
without any benefit.
In kernel, many place doesn't use get_xxx naming.

-- 
Kind regards,
Minchan Kim
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()

2010-10-05 Thread Minchan Kim
On Wed, Oct 6, 2010 at 8:26 AM, Greg Thelen  wrote:
> Minchan Kim  writes:
>
>> On Sun, Oct 03, 2010 at 11:57:59PM -0700, Greg Thelen wrote:
>>> If pages are being migrated from a memcg, then updates to that
>>> memcg's page statistics are protected by grabbing a bit spin lock
>>> using lock_page_cgroup().  In an upcoming commit memcg dirty page
>>> accounting will be updating memcg page accounting (specifically:
>>> num writeback pages) from softirq.  Avoid a deadlocking nested
>>> spin lock attempt by disabling interrupts on the local processor
>>> when grabbing the page_cgroup bit_spin_lock in lock_page_cgroup().
>>> This avoids the following deadlock:
>>> statistic
>>>       CPU 0             CPU 1
>>>                     inc_file_mapped
>>>                     rcu_read_lock
>>>   start move
>>>   synchronize_rcu
>>>                     lock_page_cgroup
>>>                       softirq
>>>                       test_clear_page_writeback
>>>                       mem_cgroup_dec_page_stat(NR_WRITEBACK)
>>>                       rcu_read_lock
>>>                       lock_page_cgroup   /* deadlock */
>>>                       unlock_page_cgroup
>>>                       rcu_read_unlock
>>>                     unlock_page_cgroup
>>>                     rcu_read_unlock
>>>
>>> By disabling interrupts in lock_page_cgroup, nested calls
>>> are avoided.  The softirq would be delayed until after inc_file_mapped
>>> enables interrupts when calling unlock_page_cgroup().
>>>
>>> The normal, fast path, of memcg page stat updates typically
>>> does not need to call lock_page_cgroup(), so this change does
>>> not affect the performance of the common case page accounting.
>>>
>>> Signed-off-by: Andrea Righi 
>>> Signed-off-by: Greg Thelen 
>>> ---
>>>  include/linux/page_cgroup.h |    8 +-
>>>  mm/memcontrol.c             |   51 
>>> +-
>>>  2 files changed, 36 insertions(+), 23 deletions(-)
>>>
>>> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
>>> index b59c298..872f6b1 100644
>>> --- a/include/linux/page_cgroup.h
>>> +++ b/include/linux/page_cgroup.h
>>> @@ -117,14 +117,18 @@ static inline enum zone_type page_cgroup_zid(struct 
>>> page_cgroup *pc)
>>>      return page_zonenum(pc->page);
>>>  }
>>>
>>> -static inline void lock_page_cgroup(struct page_cgroup *pc)
>>> +static inline void lock_page_cgroup(struct page_cgroup *pc,
>>> +                                unsigned long *flags)
>>>  {
>>> +    local_irq_save(*flags);
>>>      bit_spin_lock(PCG_LOCK, &pc->flags);
>>>  }
>>
>> Hmm. Let me ask questions.
>>
>> 1. Why do you add new irq disable region in general function?
>> I think __do_fault is a one of fast path.
>
> This is true.  I used pft to measure the cost of this extra locking
> code.  This pft workload exercises this memcg call stack:
>        lock_page_cgroup+0x39/0x5b
>        __mem_cgroup_commit_charge+0x2c/0x98
>        mem_cgroup_charge_common+0x66/0x76
>        mem_cgroup_newpage_charge+0x40/0x4f
>        handle_mm_fault+0x2e3/0x869
>        do_page_fault+0x286/0x29b
>        page_fault+0x1f/0x30
>
> I ran 100 iterations of "pft -m 8g -t 16 -a" and focused on the
> flt/cpu/s.
>
> First I established a performance baseline using upstream mmotm locking
> (not disabling interrupts).
>        100 samples: mean 51930.16383  stddev 2.032% (1055.40818272)
>
> Then I introduced this patch, which disabled interrupts in
> lock_page_cgroup():
>        100 samples: mean 52174.17434  stddev 1.306% (681.14442646)
>
> Then I replaced this patch's usage of local_irq_save/restore() with
> local_bh_disable/enable().
>        100 samples: mean 51810.58591  stddev 1.892% (980.340335322)
>
> The proposed patch (#2) actually improves allocation performance by
> 0.47% when compared to the baseline (#1).  However, I believe that this
> is in the statistical noise.  This particular workload does not seem to
> be affected this patch.

Yes. But irq disable has a interrupt latency problem as well as just
overhead of instruction.
I have a concern about interrupt latency.
I have a experience that too many disable irq makes irq handler
latency too slow in embedded system.
For example, irq handler latency is a important factor in ARM perf to
capture program counter.
That's because ARM perf doesn't use NMI handler.

>
>> Could you disable softirq using _local_bh_disable_ not in general function
>> but in your context?
>
> lock_page_cgroup() is only used by mem_cgroup in memcontrol.c.
>
> local_bh_disable() should also work instead of my proposed patch, which
> used local_irq_save/restore().  local_bh_disable() will not disable all
> interrupts so it should have less impact.  But I think that usage of
> local_bh_disable() it still something that has to happen in the general
> lock_page_cgroup() function.  The softirq can occur at an arbitrary time
> and processor with the possibility of interrupting anyone who does not
> have interrupts or softirq disabled.  Th

[Devel] Re: [PATCH 03/10] memcg: create extensible page stat update routines

2010-10-05 Thread Minchan Kim
On Wed, Oct 6, 2010 at 4:59 AM, Greg Thelen  wrote:
> Minchan Kim  writes:
>
>> On Sun, Oct 03, 2010 at 11:57:58PM -0700, Greg Thelen wrote:
>>> Replace usage of the mem_cgroup_update_file_mapped() memcg
>>> statistic update routine with two new routines:
>>> * mem_cgroup_inc_page_stat()
>>> * mem_cgroup_dec_page_stat()
>>>
>>> As before, only the file_mapped statistic is managed.  However,
>>> these more general interfaces allow for new statistics to be
>>> more easily added.  New statistics are added with memcg dirty
>>> page accounting.
>>>
>>> Signed-off-by: Greg Thelen 
>>> Signed-off-by: Andrea Righi 
>>> ---
>>>  include/linux/memcontrol.h |   31 ---
>>>  mm/memcontrol.c            |   17 -
>>>  mm/rmap.c                  |    4 ++--
>>>  3 files changed, 38 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index 159a076..7c7bec4 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -25,6 +25,11 @@ struct page_cgroup;
>>>  struct page;
>>>  struct mm_struct;
>>>
>>> +/* Stats that can be updated by kernel. */
>>> +enum mem_cgroup_write_page_stat_item {
>>> +    MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
>>> +};
>>> +
>>
>> mem_cgrou_"write"_page_stat_item?
>> Does "write" make sense to abstract page_state generally?
>
> First I will summarize the portion of the design relevant to this
> comment:
>
> This patch series introduces two sets of memcg statistics.
> a) the writable set of statistics the kernel updates when pages change
>   state (example: when a page becomes dirty) using:
>     mem_cgroup_inc_page_stat(struct page *page,
>                                enum mem_cgroup_write_page_stat_item idx)
>     mem_cgroup_dec_page_stat(struct page *page,
>                                enum mem_cgroup_write_page_stat_item idx)
>
> b) the read-only set of statistics the kernel queries to measure the
>   amount of dirty memory used by the current cgroup using:
>     s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item)
>
>   This read-only set of statistics is set of a higher level conceptual
>   counters.  For example, MEMCG_NR_DIRTYABLE_PAGES is the sum of the
>   counts of pages in various states (active + inactive).  mem_cgroup
>   exports this value as a higher level counter rather than individual
>   counters (active & inactive) to minimize the number of calls into
>   mem_cgroup_page_stat().  This avoids extra calls to cgroup tree
>   iteration with for_each_mem_cgroup_tree().
>
> Notice that each of the two sets of statistics are addressed by a
> different type, mem_cgroup_{read vs write}_page_stat_item.
>
> This particular patch (memcg: create extensible page stat update
> routines) introduces part of this design.  A later patch I emailed
> (memcg: add dirty limits to mem_cgroup) added
> mem_cgroup_read_page_stat_item.
>
>
> I think the code would read better if I renamed
> enum mem_cgroup_write_page_stat_item to
> enum mem_cgroup_update_page_stat_item.
>
> Would this address your concern

Thanks for the kind explanation.
I understand your concept.

I think you makes update and query as completely different level
abstraction but you could use similar terms.
Even the terms(write VS read) make me more confusing.

How about renaming following as?

1. mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
2. mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item

At least it looks to be easy for me to understand the code.
But it's just my preference. If others think your semantic is more
desirable, I am not against it strongly.

Thanks, Greg.

>
> --
> Greg
>



-- 
Kind regards,
Minchan Kim
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()

2010-10-05 Thread Greg Thelen
Minchan Kim  writes:

> On Sun, Oct 03, 2010 at 11:57:59PM -0700, Greg Thelen wrote:
>> If pages are being migrated from a memcg, then updates to that
>> memcg's page statistics are protected by grabbing a bit spin lock
>> using lock_page_cgroup().  In an upcoming commit memcg dirty page
>> accounting will be updating memcg page accounting (specifically:
>> num writeback pages) from softirq.  Avoid a deadlocking nested
>> spin lock attempt by disabling interrupts on the local processor
>> when grabbing the page_cgroup bit_spin_lock in lock_page_cgroup().
>> This avoids the following deadlock:
>> statistic
>>   CPU 0 CPU 1
>> inc_file_mapped
>> rcu_read_lock
>>   start move
>>   synchronize_rcu
>> lock_page_cgroup
>>   softirq
>>   test_clear_page_writeback
>>   mem_cgroup_dec_page_stat(NR_WRITEBACK)
>>   rcu_read_lock
>>   lock_page_cgroup   /* deadlock */
>>   unlock_page_cgroup
>>   rcu_read_unlock
>> unlock_page_cgroup
>> rcu_read_unlock
>> 
>> By disabling interrupts in lock_page_cgroup, nested calls
>> are avoided.  The softirq would be delayed until after inc_file_mapped
>> enables interrupts when calling unlock_page_cgroup().
>> 
>> The normal, fast path, of memcg page stat updates typically
>> does not need to call lock_page_cgroup(), so this change does
>> not affect the performance of the common case page accounting.
>> 
>> Signed-off-by: Andrea Righi 
>> Signed-off-by: Greg Thelen 
>> ---
>>  include/linux/page_cgroup.h |8 +-
>>  mm/memcontrol.c |   51 
>> +-
>>  2 files changed, 36 insertions(+), 23 deletions(-)
>> 
>> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
>> index b59c298..872f6b1 100644
>> --- a/include/linux/page_cgroup.h
>> +++ b/include/linux/page_cgroup.h
>> @@ -117,14 +117,18 @@ static inline enum zone_type page_cgroup_zid(struct 
>> page_cgroup *pc)
>>  return page_zonenum(pc->page);
>>  }
>>  
>> -static inline void lock_page_cgroup(struct page_cgroup *pc)
>> +static inline void lock_page_cgroup(struct page_cgroup *pc,
>> +unsigned long *flags)
>>  {
>> +local_irq_save(*flags);
>>  bit_spin_lock(PCG_LOCK, &pc->flags);
>>  }
>
> Hmm. Let me ask questions. 
>
> 1. Why do you add new irq disable region in general function?
> I think __do_fault is a one of fast path.

This is true.  I used pft to measure the cost of this extra locking
code.  This pft workload exercises this memcg call stack:
lock_page_cgroup+0x39/0x5b
__mem_cgroup_commit_charge+0x2c/0x98
mem_cgroup_charge_common+0x66/0x76
mem_cgroup_newpage_charge+0x40/0x4f
handle_mm_fault+0x2e3/0x869
do_page_fault+0x286/0x29b
page_fault+0x1f/0x30

I ran 100 iterations of "pft -m 8g -t 16 -a" and focused on the
flt/cpu/s.

First I established a performance baseline using upstream mmotm locking
(not disabling interrupts).
100 samples: mean 51930.16383  stddev 2.032% (1055.40818272)

Then I introduced this patch, which disabled interrupts in
lock_page_cgroup():
100 samples: mean 52174.17434  stddev 1.306% (681.14442646)

Then I replaced this patch's usage of local_irq_save/restore() with
local_bh_disable/enable().
100 samples: mean 51810.58591  stddev 1.892% (980.340335322)

The proposed patch (#2) actually improves allocation performance by
0.47% when compared to the baseline (#1).  However, I believe that this
is in the statistical noise.  This particular workload does not seem to
be affected this patch.

> Could you disable softirq using _local_bh_disable_ not in general function 
> but in your context?

lock_page_cgroup() is only used by mem_cgroup in memcontrol.c.

local_bh_disable() should also work instead of my proposed patch, which
used local_irq_save/restore().  local_bh_disable() will not disable all
interrupts so it should have less impact.  But I think that usage of
local_bh_disable() it still something that has to happen in the general
lock_page_cgroup() function.  The softirq can occur at an arbitrary time
and processor with the possibility of interrupting anyone who does not
have interrupts or softirq disabled.  Therefore the softirq could
interrupt code that has used lock_page_cgroup(), unless
lock_page_cgroup() explicitly (as proposed) disables interrupts (or
softirq).  If (as you suggest) some calls to lock_page_cgroup() did not
disable softirq, then a deadlock is possible because the softirq may
interrupt the holder of the page cgroup spinlock and the softirq routine
that also wants the spinlock would spin forever.

Is there a preference between local_bh_disable() and local_irq_save()?
Currently the page uses local_irq_save().  However I think it could work

[Devel] Re: [PATCH 00/10] memcg: per cgroup dirty page accounting

2010-10-05 Thread Andrea Righi
On Sun, Oct 03, 2010 at 11:57:55PM -0700, Greg Thelen wrote:
> This patch set provides the ability for each cgroup to have independent dirty
> page limits.
> 
> Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
> page cache used by a cgroup.  So, in case of multiple cgroup writers, they 
> will
> not be able to consume more than their designated share of dirty pages and 
> will
> be forced to perform write-out if they cross that limit.
> 
> These patches were developed and tested on mmotm 2010-09-28-16-13.  The 
> patches
> are based on a series proposed by Andrea Righi in Mar 2010.
> 
> Overview:
> - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
>   unstable.
> - Extend mem_cgroup to record the total number of pages in each of the 
>   interesting dirty states (dirty, writeback, unstable_nfs).  
> - Add dirty parameters similar to the system-wide  /proc/sys/vm/dirty_*
>   limits to mem_cgroup.  The mem_cgroup dirty parameters are accessible
>   via cgroupfs control files.
> - Consider both system and per-memcg dirty limits in page writeback when
>   deciding to queue background writeback or block for foreground writeback.
> 
> Known shortcomings:
> - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
>   writeback dirty inodes.  Bdi writeback considers inodes from any cgroup, not
>   just inodes contributing dirty pages to the cgroup exceeding its limit.  
> 
> Performance measurements:
> - kernel builds are unaffected unless run with a small dirty limit.
> - all data collected with CONFIG_CGROUP_MEM_RES_CTLR=y.
> - dd has three data points (in secs) for three data sizes (100M, 200M, and 
> 1G).  
>   As expected, dd slows when it exceed its cgroup dirty limit.
> 
>kernel_build  dd
> mmotm 2:370.18, 0.38, 1.65
>   root_memcg
> 
> mmotm 2:370.18, 0.35, 1.66
>   non-root_memcg
> 
> mmotm+patches 2:370.18, 0.35, 1.68
>   root_memcg
> 
> mmotm+patches 2:370.19, 0.35, 1.69
>   non-root_memcg
> 
> mmotm+patches 2:370.19, 2.34, 22.82
>   non-root_memcg
>   150 MiB memcg dirty limit
> 
> mmotm+patches 3:581.71, 3.38, 17.33
>   non-root_memcg
>   1 MiB memcg dirty limit

Hi Greg,

the patchset seems to work fine on my box.

I also ran a pretty simple test to directly verify the effectiveness of
the dirty memory limit, using a dd running on a non-root memcg:

  dd if=/dev/zero of=tmpfile bs=1M count=512

and monitoring the max of the "dirty" value in cgroup/memory.stat:

Here the results:
  dd in non-root memcg (  4 MiB memcg dirty limit): dirty max=4227072
  dd in non-root memcg (  8 MiB memcg dirty limit): dirty max=8454144
  dd in non-root memcg ( 16 MiB memcg dirty limit): dirty max=15179776
  dd in non-root memcg ( 32 MiB memcg dirty limit): dirty max=32235520
  dd in non-root memcg ( 64 MiB memcg dirty limit): dirty max=64245760
  dd in non-root memcg (128 MiB memcg dirty limit): dirty max=121028608
  dd in non-root memcg (256 MiB memcg dirty limit): dirty max=232865792
  dd in non-root memcg (512 MiB memcg dirty limit): dirty max=445194240

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 05/10] memcg: add dirty page accounting infrastructure

2010-10-05 Thread Greg Thelen
Minchan Kim  writes:

> On Sun, Oct 03, 2010 at 11:58:00PM -0700, Greg Thelen wrote:
>> Add memcg routines to track dirty, writeback, and unstable_NFS pages.
>> These routines are not yet used by the kernel to count such pages.
>> A later change adds kernel calls to these new routines.
>> 
>> Signed-off-by: Greg Thelen 
>> Signed-off-by: Andrea Righi 
>> ---
>>  include/linux/memcontrol.h |3 +
>>  mm/memcontrol.c|   89 
>> 
>>  2 files changed, 84 insertions(+), 8 deletions(-)
>> 
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 7c7bec4..6303da1 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -28,6 +28,9 @@ struct mm_struct;
>>  /* Stats that can be updated by kernel. */
>>  enum mem_cgroup_write_page_stat_item {
>>  MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
>> +MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
>> +MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
>> +MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>>  };
>>  
>>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 267d774..f40839f 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -85,10 +85,13 @@ enum mem_cgroup_stat_index {
>>   */
>>  MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
>>  MEM_CGROUP_STAT_RSS,   /* # of pages charged as anon rss */
>> -MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>>  MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
>>  MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
>>  MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
>> +MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>> +MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
>> +MEM_CGROUP_STAT_FILE_WRITEBACK, /* # of pages under writeback */
>> +MEM_CGROUP_STAT_FILE_UNSTABLE_NFS,  /* # of NFS unstable pages */
>>  MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
>>  /* incremented at every  pagein/pageout */
>>  MEM_CGROUP_EVENTS = MEM_CGROUP_STAT_DATA,
>> @@ -1626,6 +1629,48 @@ void mem_cgroup_update_page_stat(struct page *page,
>>  ClearPageCgroupFileMapped(pc);
>>  idx = MEM_CGROUP_STAT_FILE_MAPPED;
>>  break;
>> +
>> +case MEMCG_NR_FILE_DIRTY:
>> +/* Use Test{Set,Clear} to only un/charge the memcg once. */
>> +if (val > 0) {
>> +if (TestSetPageCgroupFileDirty(pc))
>> +/* already set */
>
> Nitpick. 
> The comment doesn't give any useful information.
> It looks like redundant. 

I agree.  I removed the four redundant comments you referred to.  Thanks
for the feedback.

>> +val = 0;
>> +} else {
>> +if (!TestClearPageCgroupFileDirty(pc))
>> +/* already cleared */
>
> Ditto
>
>> +val = 0;
>> +}
>> +idx = MEM_CGROUP_STAT_FILE_DIRTY;
>> +break;
>> +
>> +case MEMCG_NR_FILE_WRITEBACK:
>> +/*
>> + * This counter is adjusted while holding the mapping's
>> + * tree_lock.  Therefore there is no race between settings and
>> + * clearing of this flag.
>> + */
>> +if (val > 0)
>> +SetPageCgroupFileWriteback(pc);
>> +else
>> +ClearPageCgroupFileWriteback(pc);
>> +idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
>> +break;
>> +
>> +case MEMCG_NR_FILE_UNSTABLE_NFS:
>> +/* Use Test{Set,Clear} to only un/charge the memcg once. */
>> +if (val > 0) {
>> +if (TestSetPageCgroupFileUnstableNFS(pc))
>> +/* already set */
>
> Ditto 
>
>> +val = 0;
>> +} else {
>> +if (!TestClearPageCgroupFileUnstableNFS(pc))
>> +/* already cleared */
>
> Ditto 
>
>> +val = 0;
>> +}
>> +idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
>> +break;
>> +
>>  default:
>>  BUG();
>>  }
>> @@ -2133,6 +2178,16 @@ static void __mem_cgroup_commit_charge(struct 
>> mem_cgroup *mem,
>>  memcg_check_events(mem, pc->page);
>>  }
>>  
>> +static void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
>> +  struct mem_cgroup *to,
>> +  enum mem_cgroup_stat_index idx)
>> +{
>> +preempt_disable();
>> +__this_cpu_dec(from->stat->count[idx]);
>> +__this_cpu_inc(to->stat->count[idx]);
>> +preempt_enable();
>> +}
>>

[Devel] Re: [PATCH 03/10] memcg: create extensible page stat update routines

2010-10-05 Thread Greg Thelen
Minchan Kim  writes:

> On Sun, Oct 03, 2010 at 11:57:58PM -0700, Greg Thelen wrote:
>> Replace usage of the mem_cgroup_update_file_mapped() memcg
>> statistic update routine with two new routines:
>> * mem_cgroup_inc_page_stat()
>> * mem_cgroup_dec_page_stat()
>> 
>> As before, only the file_mapped statistic is managed.  However,
>> these more general interfaces allow for new statistics to be
>> more easily added.  New statistics are added with memcg dirty
>> page accounting.
>> 
>> Signed-off-by: Greg Thelen 
>> Signed-off-by: Andrea Righi 
>> ---
>>  include/linux/memcontrol.h |   31 ---
>>  mm/memcontrol.c|   17 -
>>  mm/rmap.c  |4 ++--
>>  3 files changed, 38 insertions(+), 14 deletions(-)
>> 
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 159a076..7c7bec4 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -25,6 +25,11 @@ struct page_cgroup;
>>  struct page;
>>  struct mm_struct;
>>  
>> +/* Stats that can be updated by kernel. */
>> +enum mem_cgroup_write_page_stat_item {
>> +MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
>> +};
>> +
>
> mem_cgrou_"write"_page_stat_item?
> Does "write" make sense to abstract page_state generally?

First I will summarize the portion of the design relevant to this
comment:

This patch series introduces two sets of memcg statistics.
a) the writable set of statistics the kernel updates when pages change
   state (example: when a page becomes dirty) using:
 mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_write_page_stat_item idx)
 mem_cgroup_dec_page_stat(struct page *page,
enum mem_cgroup_write_page_stat_item idx)

b) the read-only set of statistics the kernel queries to measure the
   amount of dirty memory used by the current cgroup using:
 s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item)

   This read-only set of statistics is set of a higher level conceptual
   counters.  For example, MEMCG_NR_DIRTYABLE_PAGES is the sum of the
   counts of pages in various states (active + inactive).  mem_cgroup
   exports this value as a higher level counter rather than individual
   counters (active & inactive) to minimize the number of calls into
   mem_cgroup_page_stat().  This avoids extra calls to cgroup tree
   iteration with for_each_mem_cgroup_tree().

Notice that each of the two sets of statistics are addressed by a
different type, mem_cgroup_{read vs write}_page_stat_item.

This particular patch (memcg: create extensible page stat update
routines) introduces part of this design.  A later patch I emailed
(memcg: add dirty limits to mem_cgroup) added
mem_cgroup_read_page_stat_item.


I think the code would read better if I renamed 
enum mem_cgroup_write_page_stat_item to 
enum mem_cgroup_update_page_stat_item.

Would this address your concern?

--
Greg
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 1/1] cgroups: strcpy destination string overflow

2010-10-05 Thread Paul Menage
On Tue, Oct 5, 2010 at 12:48 PM, Andrew Morton
 wrote:
> On Tue,  5 Oct 2010 12:38:05 +0400
> Evgeny Kuznetsov  wrote:
>
>> From: Evgeny Kuznetsov 
>>
>> Function "strcpy" is used without check for maximum allowed source
>> string length and could cause destination string overflow.
>> Check for string length is added before using "strcpy".
>> Function now is return error if source string length is more than
>> a maximum.
>>
>> Signed-off-by: Evgeny Kuznetsov 
>> ---
>>  kernel/cgroup.c |    2 ++
>>  1 files changed, 2 insertions(+), 0 deletions(-)
>>
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index c9483d8..82bbede 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -1883,6 +1883,8 @@ static int cgroup_release_agent_write(struct cgroup 
>> *cgrp, struct cftype *cft,
>>                                     const char *buffer)
>>  {
>>       BUILD_BUG_ON(sizeof(cgrp->root->release_agent_path) < PATH_MAX);
>> +     if (strlen(buffer) >= PATH_MAX)
>> +             return -EINVAL;
>>       if (!cgroup_lock_live_group(cgrp))
>>               return -ENODEV;
>>       strcpy(cgrp->root->release_agent_path, buffer);
>
> I don't think this can happen, because cftype.max_write_len is
> PATH_MAX.

Yes, it shouldn't be possible.

>
> But it's pretty unobvious if this is actually true, and the code is
> fragile against future changes.

Fair enough - adding the check doesn't hurt anything.

Acked-by: Paul Menage 
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 1/1] cgroups: strcpy destination string overflow

2010-10-05 Thread Andrew Morton
On Tue,  5 Oct 2010 12:38:05 +0400
Evgeny Kuznetsov  wrote:

> From: Evgeny Kuznetsov 
> 
> Function "strcpy" is used without check for maximum allowed source
> string length and could cause destination string overflow.
> Check for string length is added before using "strcpy".
> Function now is return error if source string length is more than
> a maximum.
> 
> Signed-off-by: Evgeny Kuznetsov 
> ---
>  kernel/cgroup.c |2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index c9483d8..82bbede 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1883,6 +1883,8 @@ static int cgroup_release_agent_write(struct cgroup 
> *cgrp, struct cftype *cft,
> const char *buffer)
>  {
>   BUILD_BUG_ON(sizeof(cgrp->root->release_agent_path) < PATH_MAX);
> + if (strlen(buffer) >= PATH_MAX)
> + return -EINVAL;
>   if (!cgroup_lock_live_group(cgrp))
>   return -ENODEV;
>   strcpy(cgrp->root->release_agent_path, buffer);

I don't think this can happen, because cftype.max_write_len is
PATH_MAX.

But it's pretty unobvious if this is actually true, and the code is
fragile against future changes.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 07/10] memcg: add dirty limits to mem_cgroup

2010-10-05 Thread Greg Thelen
Andrea Righi  writes:

> On Sun, Oct 03, 2010 at 11:58:02PM -0700, Greg Thelen wrote:
>> Extend mem_cgroup to contain dirty page limits.  Also add routines
>> allowing the kernel to query the dirty usage of a memcg.
>> 
>> These interfaces not used by the kernel yet.  A subsequent commit
>> will add kernel calls to utilize these new routines.
>
> A small note below.
>
>> 
>> Signed-off-by: Greg Thelen 
>> Signed-off-by: Andrea Righi 
>> ---
>>  include/linux/memcontrol.h |   44 +++
>>  mm/memcontrol.c|  180 
>> +++-
>>  2 files changed, 223 insertions(+), 1 deletions(-)
>> 
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 6303da1..dc8952d 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -19,6 +19,7 @@
>>  
>>  #ifndef _LINUX_MEMCONTROL_H
>>  #define _LINUX_MEMCONTROL_H
>> +#include 
>>  #include 
>>  struct mem_cgroup;
>>  struct page_cgroup;
>> @@ -33,6 +34,30 @@ enum mem_cgroup_write_page_stat_item {
>>  MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>>  };
>>  
>> +/* Cgroup memory statistics items exported to the kernel */
>> +enum mem_cgroup_read_page_stat_item {
>> +MEMCG_NR_DIRTYABLE_PAGES,
>> +MEMCG_NR_RECLAIM_PAGES,
>> +MEMCG_NR_WRITEBACK,
>> +MEMCG_NR_DIRTY_WRITEBACK_PAGES,
>> +};
>> +
>> +/* Dirty memory parameters */
>> +struct vm_dirty_param {
>> +int dirty_ratio;
>> +int dirty_background_ratio;
>> +unsigned long dirty_bytes;
>> +unsigned long dirty_background_bytes;
>> +};
>> +
>> +static inline void get_global_vm_dirty_param(struct vm_dirty_param *param)
>> +{
>> +param->dirty_ratio = vm_dirty_ratio;
>> +param->dirty_bytes = vm_dirty_bytes;
>> +param->dirty_background_ratio = dirty_background_ratio;
>> +param->dirty_background_bytes = dirty_background_bytes;
>> +}
>> +
>>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>>  struct list_head *dst,
>>  unsigned long *scanned, int order,
>> @@ -145,6 +170,10 @@ static inline void mem_cgroup_dec_page_stat(struct page 
>> *page,
>>  mem_cgroup_update_page_stat(page, idx, -1);
>>  }
>>  
>> +bool mem_cgroup_has_dirty_limit(void);
>> +void get_vm_dirty_param(struct vm_dirty_param *param);
>> +s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item);
>> +
>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>  gfp_t gfp_mask);
>>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>> @@ -326,6 +355,21 @@ static inline void mem_cgroup_dec_page_stat(struct page 
>> *page,
>>  {
>>  }
>>  
>> +static inline bool mem_cgroup_has_dirty_limit(void)
>> +{
>> +return false;
>> +}
>> +
>> +static inline void get_vm_dirty_param(struct vm_dirty_param *param)
>> +{
>> +get_global_vm_dirty_param(param);
>> +}
>> +
>> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item 
>> item)
>> +{
>> +return -ENOSYS;
>> +}
>> +
>>  static inline
>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>  gfp_t gfp_mask)
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index f40839f..6ec2625 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -233,6 +233,10 @@ struct mem_cgroup {
>>  atomic_trefcnt;
>>  
>>  unsigned intswappiness;
>> +
>> +/* control memory cgroup dirty pages */
>> +struct vm_dirty_param dirty_param;
>> +
>>  /* OOM-Killer disable */
>>  int oom_kill_disable;
>>  
>> @@ -1132,6 +1136,172 @@ static unsigned int get_swappiness(struct mem_cgroup 
>> *memcg)
>>  return swappiness;
>>  }
>>  
>> +/*
>> + * Returns a snapshot of the current dirty limits which is not synchronized 
>> with
>> + * the routines that change the dirty limits.  If this routine races with an
>> + * update to the dirty bytes/ratio value, then the caller must handle the 
>> case
>> + * where both dirty_[background_]_ratio and _bytes are set.
>> + */
>> +static void __mem_cgroup_get_dirty_param(struct vm_dirty_param *param,
>> + struct mem_cgroup *mem)
>> +{
>> +if (mem && !mem_cgroup_is_root(mem)) {
>> +param->dirty_ratio = mem->dirty_param.dirty_ratio;
>> +param->dirty_bytes = mem->dirty_param.dirty_bytes;
>> +param->dirty_background_ratio =
>> +mem->dirty_param.dirty_background_ratio;
>> +param->dirty_background_bytes =
>> +mem->dirty_param.dirty_background_bytes;
>> +} else {
>> +get_global_vm_dirty_param(param);
>> +}
>> +}
>> +
>> +/*
>> + * Get dirty memory parameters of the current memcg or global values (if 
>> memory
>> + * cgroups are disabled or querying the root cgroup).
>> + */
>> +void get_vm_dirty_param(stru

[Devel] Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits

2010-10-05 Thread David Rientjes
On Tue, 5 Oct 2010, Andrea Righi wrote:

> mmh... looking at the code it seems the same behaviour, but in
> Documentation/sysctl/vm.txt we say a different thing (i.e., for
> dirty_bytes):
> 
> "If dirty_bytes is written, dirty_ratio becomes a function of its value
> (dirty_bytes / the amount of dirtyable system memory)."
> 
> However, in dirty_bytes_handler()/dirty_ratio_handler() we actually set
> the counterpart value as 0.
> 
> I think we should clarify the documentation.
> 
> Signed-off-by: Andrea Righi 

Acked-by: David Rientjes 

Thanks for cc'ing me on this, Andrea.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC V1] Replace pid_t in autofs4 with struct pid reference.

2010-10-05 Thread Daniel Lezcano
I resurect and refreshed this old patch from
https://lists.linux-foundation.org/pipermail/containers/2007-February/003726.html

This patch makes automount to work within a container.

Make autofs4 container-friendly by caching struct pid reference rather
than pid_t and using pid_nr() to retreive a task's pid_t.

ChangeLog:

V1:
- fixed pgrp option in parse_options
- used get_task_pid(current, PIDTYPE_PGID) instead of task_pgrp
- fixed how is passed the 'pgrp' argument autofs4_fill_super
- fixed bad pid conversion, was pid_vnr not pid_nr in autofs4_wait
V0:
- Refreshed against linux-next (added dev-ioctl.c)
- Fix Eric Biederman's comments - Use find_get_pid() to hold a
  reference to oz_pgrp and release while unmounting; separate out
  changes to autofs and autofs4.
- Also rollback my earlier change to autofs_wait_queue (pid and tgid
  in the wait queue are just used to write to a userspace daemon's
  pipe).
- Fix Cedric's comments: retain old prototype of parse_options()
  and move necessary change to its caller.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Daniel Lezcano 
Cc: Ian Kent 
Cc: Cedric Le Goater 
Cc: Dave Hansen 
Cc: Serge E. Hallyn 
Cc: Eric Biederman 
Cc: Helmut Lichtenberg 
---
 fs/autofs4/autofs_i.h  |   31 +--
 fs/autofs4/dev-ioctl.c |   32 +++-
 fs/autofs4/inode.c |   35 +--
 fs/autofs4/root.c  |3 ++-
 fs/autofs4/waitq.c |4 ++--
 5 files changed, 65 insertions(+), 40 deletions(-)

diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index 3d283ab..bf2f9b1 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -39,25 +39,25 @@
 /* #define DEBUG */
 
 #ifdef DEBUG
-#define DPRINTK(fmt, args...)  \
-do {   \
-   printk(KERN_DEBUG "pid %d: %s: " fmt "\n",  \
-   current->pid, __func__, ##args);\
+#define DPRINTK(fmt, args...)   \
+do {\
+   printk(KERN_DEBUG "pid %d: %s: " fmt "\n",   \
+  pid_nr(task_pid(current)), __func__, ##args); \
 } while (0)
 #else
 #define DPRINTK(fmt, args...) do {} while (0)
 #endif
 
-#define AUTOFS_WARN(fmt, args...)  \
-do {   \
-   printk(KERN_WARNING "pid %d: %s: " fmt "\n",\
-   current->pid, __func__, ##args);\
+#define AUTOFS_WARN(fmt, args...)   \
+do {\
+   printk(KERN_WARNING "pid %d: %s: " fmt "\n", \
+  pid_nr(task_pid(current)), __func__, ##args); \
 } while (0)
 
-#define AUTOFS_ERROR(fmt, args...) \
-do {   \
-   printk(KERN_ERR "pid %d: %s: " fmt "\n",\
-   current->pid, __func__, ##args);\
+#define AUTOFS_ERROR(fmt, args...)  \
+do {\
+   printk(KERN_ERR "pid %d: %s: " fmt "\n", \
+  pid_nr(task_pid(current)), __func__, ##args); \
 } while (0)
 
 /* Unified info structure.  This is pointed to by both the dentry and
@@ -122,7 +122,7 @@ struct autofs_sb_info {
u32 magic;
int pipefd;
struct file *pipe;
-   pid_t oz_pgrp;
+   struct pid *oz_pgrp;
int catatonic;
int version;
int sub_version;
@@ -156,7 +156,10 @@ static inline struct autofs_info 
*autofs4_dentry_ino(struct dentry *dentry)
filesystem without "magic".) */
 
 static inline int autofs4_oz_mode(struct autofs_sb_info *sbi) {
-   return sbi->catatonic || task_pgrp_nr(current) == sbi->oz_pgrp;
+   struct pid *pgrp = get_task_pid(current, PIDTYPE_PGID);
+   bool oz_mode = sbi->catatonic || sbi->oz_pgrp == pgrp;
+   put_pid(pgrp);
+   return oz_mode;
 }
 
 /* Does a dentry have some pending activity? */
diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c
index eff9a41..7db3b73 100644
--- a/fs/autofs4/dev-ioctl.c
+++ b/fs/autofs4/dev-ioctl.c
@@ -360,6 +360,7 @@ static int autofs_dev_ioctl_setpipefd(struct file *fp,
 {
int pipefd;
int err = 0;
+   struct file *pipe;
 
if (param->setpipefd.pipefd == -1)
return -EINVAL;
@@ -368,22 +369,27 @@ static int autofs_dev_ioctl_setpipefd(struct file *fp,
 
mutex_lock(&sbi->wq_mutex);
if (!sbi->catatonic) {
-   mutex_unlock(&sbi->wq_mutex);
-   return -EBUSY;
-   } else {
-   struct file *pipe = fget(pipefd);
-   if (!pipe->f_op || !pipe->f_op->write) {
-   err = -EPIPE;
-   fput(pipe);
-

[Devel] Re: [PATCH 05/10] memcg: add dirty page accounting infrastructure

2010-10-05 Thread Minchan Kim
On Sun, Oct 03, 2010 at 11:58:00PM -0700, Greg Thelen wrote:
> Add memcg routines to track dirty, writeback, and unstable_NFS pages.
> These routines are not yet used by the kernel to count such pages.
> A later change adds kernel calls to these new routines.
> 
> Signed-off-by: Greg Thelen 
> Signed-off-by: Andrea Righi 
> ---
>  include/linux/memcontrol.h |3 +
>  mm/memcontrol.c|   89 
> 
>  2 files changed, 84 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7c7bec4..6303da1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -28,6 +28,9 @@ struct mm_struct;
>  /* Stats that can be updated by kernel. */
>  enum mem_cgroup_write_page_stat_item {
>   MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> + MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
> + MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
> + MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>  };
>  
>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 267d774..f40839f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -85,10 +85,13 @@ enum mem_cgroup_stat_index {
>*/
>   MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
>   MEM_CGROUP_STAT_RSS,   /* # of pages charged as anon rss */
> - MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>   MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
>   MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
>   MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> + MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
> + MEM_CGROUP_STAT_FILE_WRITEBACK, /* # of pages under writeback */
> + MEM_CGROUP_STAT_FILE_UNSTABLE_NFS,  /* # of NFS unstable pages */
>   MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
>   /* incremented at every  pagein/pageout */
>   MEM_CGROUP_EVENTS = MEM_CGROUP_STAT_DATA,
> @@ -1626,6 +1629,48 @@ void mem_cgroup_update_page_stat(struct page *page,
>   ClearPageCgroupFileMapped(pc);
>   idx = MEM_CGROUP_STAT_FILE_MAPPED;
>   break;
> +
> + case MEMCG_NR_FILE_DIRTY:
> + /* Use Test{Set,Clear} to only un/charge the memcg once. */
> + if (val > 0) {
> + if (TestSetPageCgroupFileDirty(pc))
> + /* already set */

Nitpick. 
The comment doesn't give any useful information.
It looks like redundant. 

> + val = 0;
> + } else {
> + if (!TestClearPageCgroupFileDirty(pc))
> + /* already cleared */

Ditto

> + val = 0;
> + }
> + idx = MEM_CGROUP_STAT_FILE_DIRTY;
> + break;
> +
> + case MEMCG_NR_FILE_WRITEBACK:
> + /*
> +  * This counter is adjusted while holding the mapping's
> +  * tree_lock.  Therefore there is no race between settings and
> +  * clearing of this flag.
> +  */
> + if (val > 0)
> + SetPageCgroupFileWriteback(pc);
> + else
> + ClearPageCgroupFileWriteback(pc);
> + idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
> + break;
> +
> + case MEMCG_NR_FILE_UNSTABLE_NFS:
> + /* Use Test{Set,Clear} to only un/charge the memcg once. */
> + if (val > 0) {
> + if (TestSetPageCgroupFileUnstableNFS(pc))
> + /* already set */

Ditto 

> + val = 0;
> + } else {
> + if (!TestClearPageCgroupFileUnstableNFS(pc))
> + /* already cleared */

Ditto 

> + val = 0;
> + }
> + idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
> + break;
> +
>   default:
>   BUG();
>   }
> @@ -2133,6 +2178,16 @@ static void __mem_cgroup_commit_charge(struct 
> mem_cgroup *mem,
>   memcg_check_events(mem, pc->page);
>  }
>  
> +static void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +   struct mem_cgroup *to,
> +   enum mem_cgroup_stat_index idx)
> +{
> + preempt_disable();
> + __this_cpu_dec(from->stat->count[idx]);
> + __this_cpu_inc(to->stat->count[idx]);
> + preempt_enable();
> +}
> +
>  /**
>   * __mem_cgroup_move_account - move account of the page
>   * @pc:  page_cgroup of the page.
> @@ -2159,13 +2214,18 @@ static void __mem_cgroup_move_account(s

[Devel] Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()

2010-10-05 Thread Minchan Kim
On Sun, Oct 03, 2010 at 11:57:59PM -0700, Greg Thelen wrote:
> If pages are being migrated from a memcg, then updates to that
> memcg's page statistics are protected by grabbing a bit spin lock
> using lock_page_cgroup().  In an upcoming commit memcg dirty page
> accounting will be updating memcg page accounting (specifically:
> num writeback pages) from softirq.  Avoid a deadlocking nested
> spin lock attempt by disabling interrupts on the local processor
> when grabbing the page_cgroup bit_spin_lock in lock_page_cgroup().
> This avoids the following deadlock:
> statistic
>   CPU 0 CPU 1
> inc_file_mapped
> rcu_read_lock
>   start move
>   synchronize_rcu
> lock_page_cgroup
>   softirq
>   test_clear_page_writeback
>   mem_cgroup_dec_page_stat(NR_WRITEBACK)
>   rcu_read_lock
>   lock_page_cgroup   /* deadlock */
>   unlock_page_cgroup
>   rcu_read_unlock
> unlock_page_cgroup
> rcu_read_unlock
> 
> By disabling interrupts in lock_page_cgroup, nested calls
> are avoided.  The softirq would be delayed until after inc_file_mapped
> enables interrupts when calling unlock_page_cgroup().
> 
> The normal, fast path, of memcg page stat updates typically
> does not need to call lock_page_cgroup(), so this change does
> not affect the performance of the common case page accounting.
> 
> Signed-off-by: Andrea Righi 
> Signed-off-by: Greg Thelen 
> ---
>  include/linux/page_cgroup.h |8 +-
>  mm/memcontrol.c |   51 +-
>  2 files changed, 36 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index b59c298..872f6b1 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -117,14 +117,18 @@ static inline enum zone_type page_cgroup_zid(struct 
> page_cgroup *pc)
>   return page_zonenum(pc->page);
>  }
>  
> -static inline void lock_page_cgroup(struct page_cgroup *pc)
> +static inline void lock_page_cgroup(struct page_cgroup *pc,
> + unsigned long *flags)
>  {
> + local_irq_save(*flags);
>   bit_spin_lock(PCG_LOCK, &pc->flags);
>  }

Hmm. Let me ask questions. 

1. Why do you add new irq disable region in general function?
I think __do_fault is a one of fast path.

Could you disable softirq using _local_bh_disable_ not in general function 
but in your context?
How do you expect that how many users need irq lock to update page state?
If they don't need to disalbe irq?

We can pass some argument which present to need irq lock or not.
But it seems to make code very ugly. 

2. So could you solve the problem in your design?
I mean you could update page state out of softirq?
(I didn't look at the your patches all. Sorry if I am missing something)

3. Normally, we have updated page state without disable irq. 
Why does memcg need it?

I hope we don't add irq disable region as far as possbile. 

-- 
Kind regards,
Minchan Kim
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 03/10] memcg: create extensible page stat update routines

2010-10-05 Thread Minchan Kim
On Sun, Oct 03, 2010 at 11:57:58PM -0700, Greg Thelen wrote:
> Replace usage of the mem_cgroup_update_file_mapped() memcg
> statistic update routine with two new routines:
> * mem_cgroup_inc_page_stat()
> * mem_cgroup_dec_page_stat()
> 
> As before, only the file_mapped statistic is managed.  However,
> these more general interfaces allow for new statistics to be
> more easily added.  New statistics are added with memcg dirty
> page accounting.
> 
> Signed-off-by: Greg Thelen 
> Signed-off-by: Andrea Righi 
> ---
>  include/linux/memcontrol.h |   31 ---
>  mm/memcontrol.c|   17 -
>  mm/rmap.c  |4 ++--
>  3 files changed, 38 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 159a076..7c7bec4 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -25,6 +25,11 @@ struct page_cgroup;
>  struct page;
>  struct mm_struct;
>  
> +/* Stats that can be updated by kernel. */
> +enum mem_cgroup_write_page_stat_item {
> + MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> +};
> +

mem_cgrou_"write"_page_stat_item?
Does "write" make sense to abstract page_state generally?

-- 
Kind regards,
Minchan Kim
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 07/10] memcg: add dirty limits to mem_cgroup

2010-10-05 Thread Andrea Righi
On Sun, Oct 03, 2010 at 11:58:02PM -0700, Greg Thelen wrote:
> Extend mem_cgroup to contain dirty page limits.  Also add routines
> allowing the kernel to query the dirty usage of a memcg.
> 
> These interfaces not used by the kernel yet.  A subsequent commit
> will add kernel calls to utilize these new routines.

A small note below.

> 
> Signed-off-by: Greg Thelen 
> Signed-off-by: Andrea Righi 
> ---
>  include/linux/memcontrol.h |   44 +++
>  mm/memcontrol.c|  180 
> +++-
>  2 files changed, 223 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6303da1..dc8952d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,6 +19,7 @@
>  
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +#include 
>  #include 
>  struct mem_cgroup;
>  struct page_cgroup;
> @@ -33,6 +34,30 @@ enum mem_cgroup_write_page_stat_item {
>   MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>  };
>  
> +/* Cgroup memory statistics items exported to the kernel */
> +enum mem_cgroup_read_page_stat_item {
> + MEMCG_NR_DIRTYABLE_PAGES,
> + MEMCG_NR_RECLAIM_PAGES,
> + MEMCG_NR_WRITEBACK,
> + MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
> +/* Dirty memory parameters */
> +struct vm_dirty_param {
> + int dirty_ratio;
> + int dirty_background_ratio;
> + unsigned long dirty_bytes;
> + unsigned long dirty_background_bytes;
> +};
> +
> +static inline void get_global_vm_dirty_param(struct vm_dirty_param *param)
> +{
> + param->dirty_ratio = vm_dirty_ratio;
> + param->dirty_bytes = vm_dirty_bytes;
> + param->dirty_background_ratio = dirty_background_ratio;
> + param->dirty_background_bytes = dirty_background_bytes;
> +}
> +
>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>   struct list_head *dst,
>   unsigned long *scanned, int order,
> @@ -145,6 +170,10 @@ static inline void mem_cgroup_dec_page_stat(struct page 
> *page,
>   mem_cgroup_update_page_stat(page, idx, -1);
>  }
>  
> +bool mem_cgroup_has_dirty_limit(void);
> +void get_vm_dirty_param(struct vm_dirty_param *param);
> +s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item);
> +
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>   gfp_t gfp_mask);
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> @@ -326,6 +355,21 @@ static inline void mem_cgroup_dec_page_stat(struct page 
> *page,
>  {
>  }
>  
> +static inline bool mem_cgroup_has_dirty_limit(void)
> +{
> + return false;
> +}
> +
> +static inline void get_vm_dirty_param(struct vm_dirty_param *param)
> +{
> + get_global_vm_dirty_param(param);
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item 
> item)
> +{
> + return -ENOSYS;
> +}
> +
>  static inline
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>   gfp_t gfp_mask)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f40839f..6ec2625 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -233,6 +233,10 @@ struct mem_cgroup {
>   atomic_trefcnt;
>  
>   unsigned intswappiness;
> +
> + /* control memory cgroup dirty pages */
> + struct vm_dirty_param dirty_param;
> +
>   /* OOM-Killer disable */
>   int oom_kill_disable;
>  
> @@ -1132,6 +1136,172 @@ static unsigned int get_swappiness(struct mem_cgroup 
> *memcg)
>   return swappiness;
>  }
>  
> +/*
> + * Returns a snapshot of the current dirty limits which is not synchronized 
> with
> + * the routines that change the dirty limits.  If this routine races with an
> + * update to the dirty bytes/ratio value, then the caller must handle the 
> case
> + * where both dirty_[background_]_ratio and _bytes are set.
> + */
> +static void __mem_cgroup_get_dirty_param(struct vm_dirty_param *param,
> +  struct mem_cgroup *mem)
> +{
> + if (mem && !mem_cgroup_is_root(mem)) {
> + param->dirty_ratio = mem->dirty_param.dirty_ratio;
> + param->dirty_bytes = mem->dirty_param.dirty_bytes;
> + param->dirty_background_ratio =
> + mem->dirty_param.dirty_background_ratio;
> + param->dirty_background_bytes =
> + mem->dirty_param.dirty_background_bytes;
> + } else {
> + get_global_vm_dirty_param(param);
> + }
> +}
> +
> +/*
> + * Get dirty memory parameters of the current memcg or global values (if 
> memory
> + * cgroups are disabled or querying the root cgroup).
> + */
> +void get_vm_dirty_param(struct vm_dirty_param *param)
> +{
> + struct mem_cgroup *memcg;
> +
> + if (mem_cgroup_disabled()) {
> + get_global

[Devel] Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits

2010-10-05 Thread Andrea Righi
On Tue, Oct 05, 2010 at 12:33:15AM -0700, Greg Thelen wrote:
> KAMEZAWA Hiroyuki  writes:
> 
> > On Sun,  3 Oct 2010 23:58:03 -0700
> > Greg Thelen  wrote:
> >
> >> Add cgroupfs interface to memcg dirty page limits:
> >>   Direct write-out is controlled with:
> >>   - memory.dirty_ratio
> >>   - memory.dirty_bytes
> >> 
> >>   Background write-out is controlled with:
> >>   - memory.dirty_background_ratio
> >>   - memory.dirty_background_bytes
> >> 
> >> Signed-off-by: Andrea Righi 
> >> Signed-off-by: Greg Thelen 
> >
> > Acked-by: KAMEZAWA Hiroyuki 
> >
> > a question below.
> >
> >
> >> ---
> >>  mm/memcontrol.c |   89 
> >> +++
> >>  1 files changed, 89 insertions(+), 0 deletions(-)
> >> 
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 6ec2625..2d45a0a 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -100,6 +100,13 @@ enum mem_cgroup_stat_index {
> >>MEM_CGROUP_STAT_NSTATS,
> >>  };
> >>  
> >> +enum {
> >> +  MEM_CGROUP_DIRTY_RATIO,
> >> +  MEM_CGROUP_DIRTY_BYTES,
> >> +  MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> >> +  MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> >> +};
> >> +
> >>  struct mem_cgroup_stat_cpu {
> >>s64 count[MEM_CGROUP_STAT_NSTATS];
> >>  };
> >> @@ -4292,6 +4299,64 @@ static int mem_cgroup_oom_control_write(struct 
> >> cgroup *cgrp,
> >>return 0;
> >>  }
> >>  
> >> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> >> +{
> >> +  struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> >> +  bool root;
> >> +
> >> +  root = mem_cgroup_is_root(mem);
> >> +
> >> +  switch (cft->private) {
> >> +  case MEM_CGROUP_DIRTY_RATIO:
> >> +  return root ? vm_dirty_ratio : mem->dirty_param.dirty_ratio;
> >> +  case MEM_CGROUP_DIRTY_BYTES:
> >> +  return root ? vm_dirty_bytes : mem->dirty_param.dirty_bytes;
> >> +  case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> >> +  return root ? dirty_background_ratio :
> >> +  mem->dirty_param.dirty_background_ratio;
> >> +  case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> >> +  return root ? dirty_background_bytes :
> >> +  mem->dirty_param.dirty_background_bytes;
> >> +  default:
> >> +  BUG();
> >> +  }
> >> +}
> >> +
> >> +static int
> >> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> >> +{
> >> +  struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> >> +  int type = cft->private;
> >> +
> >> +  if (cgrp->parent == NULL)
> >> +  return -EINVAL;
> >> +  if ((type == MEM_CGROUP_DIRTY_RATIO ||
> >> +   type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
> >> +  return -EINVAL;
> >> +  switch (type) {
> >> +  case MEM_CGROUP_DIRTY_RATIO:
> >> +  memcg->dirty_param.dirty_ratio = val;
> >> +  memcg->dirty_param.dirty_bytes = 0;
> >> +  break;
> >> +  case MEM_CGROUP_DIRTY_BYTES:
> >> +  memcg->dirty_param.dirty_bytes = val;
> >> +  memcg->dirty_param.dirty_ratio  = 0;
> >> +  break;
> >> +  case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> >> +  memcg->dirty_param.dirty_background_ratio = val;
> >> +  memcg->dirty_param.dirty_background_bytes = 0;
> >> +  break;
> >> +  case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> >> +  memcg->dirty_param.dirty_background_bytes = val;
> >> +  memcg->dirty_param.dirty_background_ratio = 0;
> >> +  break;
> >
> >
> > Curiousis this same behavior as vm_dirty_ratio ?
> 
> I think this is same behavior as vm_dirty_ratio.  When vm_dirty_ratio is
> changed then dirty_ratio_handler() will set vm_dirty_bytes=0.  When
> vm_dirty_bytes is written dirty_bytes_handler() will set
> vm_dirty_ratio=0.  So I think that the per-memcg dirty memory parameters
> mimic the behavior of vm_dirty_ratio, vm_dirty_bytes and the other
> global dirty parameters.
> 
> Am I missing your question?

mmh... looking at the code it seems the same behaviour, but in
Documentation/sysctl/vm.txt we say a different thing (i.e., for
dirty_bytes):

"If dirty_bytes is written, dirty_ratio becomes a function of its value
(dirty_bytes / the amount of dirtyable system memory)."

However, in dirty_bytes_handler()/dirty_ratio_handler() we actually set
the counterpart value as 0.

I think we should clarify the documentation.

Signed-off-by: Andrea Righi 
---
 Documentation/sysctl/vm.txt |   12 
 1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index b606c2c..30289fa 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -80,8 +80,10 @@ dirty_background_bytes
 Contains the amount of dirty memory at which the pdflush background writeback
 daemon will start writeback.
 
-If dirty_background_bytes is written, dirty_background_ratio becomes a function
-of its value (dirty_background_bytes / the amount of dirtyable system memory).
+Note: dirty_background_bytes is the

[Devel] Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits

2010-10-05 Thread KAMEZAWA Hiroyuki
On Tue, 05 Oct 2010 00:33:15 -0700
Greg Thelen  wrote:

> KAMEZAWA Hiroyuki  writes:
> 
> > On Sun,  3 Oct 2010 23:58:03 -0700
> > Greg Thelen  wrote:
> >
> >> Add cgroupfs interface to memcg dirty page limits:
> >>   Direct write-out is controlled with:
> >>   - memory.dirty_ratio
> >>   - memory.dirty_bytes
> >> 
> >>   Background write-out is controlled with:
> >>   - memory.dirty_background_ratio
> >>   - memory.dirty_background_bytes
> >> 
> >> Signed-off-by: Andrea Righi 
> >> Signed-off-by: Greg Thelen 
> >
> > Acked-by: KAMEZAWA Hiroyuki 
> >
> > a question below.
> >
> >
> >> ---
> >>  mm/memcontrol.c |   89 
> >> +++
> >>  1 files changed, 89 insertions(+), 0 deletions(-)
> >> 
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 6ec2625..2d45a0a 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -100,6 +100,13 @@ enum mem_cgroup_stat_index {
> >>MEM_CGROUP_STAT_NSTATS,
> >>  };
> >>  
> >> +enum {
> >> +  MEM_CGROUP_DIRTY_RATIO,
> >> +  MEM_CGROUP_DIRTY_BYTES,
> >> +  MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> >> +  MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> >> +};
> >> +
> >>  struct mem_cgroup_stat_cpu {
> >>s64 count[MEM_CGROUP_STAT_NSTATS];
> >>  };
> >> @@ -4292,6 +4299,64 @@ static int mem_cgroup_oom_control_write(struct 
> >> cgroup *cgrp,
> >>return 0;
> >>  }
> >>  
> >> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> >> +{
> >> +  struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> >> +  bool root;
> >> +
> >> +  root = mem_cgroup_is_root(mem);
> >> +
> >> +  switch (cft->private) {
> >> +  case MEM_CGROUP_DIRTY_RATIO:
> >> +  return root ? vm_dirty_ratio : mem->dirty_param.dirty_ratio;
> >> +  case MEM_CGROUP_DIRTY_BYTES:
> >> +  return root ? vm_dirty_bytes : mem->dirty_param.dirty_bytes;
> >> +  case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> >> +  return root ? dirty_background_ratio :
> >> +  mem->dirty_param.dirty_background_ratio;
> >> +  case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> >> +  return root ? dirty_background_bytes :
> >> +  mem->dirty_param.dirty_background_bytes;
> >> +  default:
> >> +  BUG();
> >> +  }
> >> +}
> >> +
> >> +static int
> >> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> >> +{
> >> +  struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> >> +  int type = cft->private;
> >> +
> >> +  if (cgrp->parent == NULL)
> >> +  return -EINVAL;
> >> +  if ((type == MEM_CGROUP_DIRTY_RATIO ||
> >> +   type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
> >> +  return -EINVAL;
> >> +  switch (type) {
> >> +  case MEM_CGROUP_DIRTY_RATIO:
> >> +  memcg->dirty_param.dirty_ratio = val;
> >> +  memcg->dirty_param.dirty_bytes = 0;
> >> +  break;
> >> +  case MEM_CGROUP_DIRTY_BYTES:
> >> +  memcg->dirty_param.dirty_bytes = val;
> >> +  memcg->dirty_param.dirty_ratio  = 0;
> >> +  break;
> >> +  case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> >> +  memcg->dirty_param.dirty_background_ratio = val;
> >> +  memcg->dirty_param.dirty_background_bytes = 0;
> >> +  break;
> >> +  case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> >> +  memcg->dirty_param.dirty_background_bytes = val;
> >> +  memcg->dirty_param.dirty_background_ratio = 0;
> >> +  break;
> >
> >
> > Curiousis this same behavior as vm_dirty_ratio ?
> 
> I think this is same behavior as vm_dirty_ratio.  When vm_dirty_ratio is
> changed then dirty_ratio_handler() will set vm_dirty_bytes=0.  When
> vm_dirty_bytes is written dirty_bytes_handler() will set
> vm_dirty_ratio=0.  So I think that the per-memcg dirty memory parameters
> mimic the behavior of vm_dirty_ratio, vm_dirty_bytes and the other
> global dirty parameters.
> 
Okay.

> Am I missing your question?
> 
No. Thank you for clarification.

-Kame

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 05/10] memcg: add dirty page accounting infrastructure

2010-10-05 Thread Greg Thelen
KAMEZAWA Hiroyuki  writes:

> On Sun,  3 Oct 2010 23:58:00 -0700
> Greg Thelen  wrote:
>
>> Add memcg routines to track dirty, writeback, and unstable_NFS pages.
>> These routines are not yet used by the kernel to count such pages.
>> A later change adds kernel calls to these new routines.
>> 
>> Signed-off-by: Greg Thelen 
>> Signed-off-by: Andrea Righi 
>
> a small request. see below.
>
>> ---
>>  include/linux/memcontrol.h |3 +
>>  mm/memcontrol.c|   89 
>> 
>>  2 files changed, 84 insertions(+), 8 deletions(-)
>> 
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 7c7bec4..6303da1 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -28,6 +28,9 @@ struct mm_struct;
>>  /* Stats that can be updated by kernel. */
>>  enum mem_cgroup_write_page_stat_item {
>>  MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
>> +MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
>> +MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
>> +MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>>  };
>>  
>>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 267d774..f40839f 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -85,10 +85,13 @@ enum mem_cgroup_stat_index {
>>   */
>>  MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
>>  MEM_CGROUP_STAT_RSS,   /* # of pages charged as anon rss */
>> -MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>>  MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
>>  MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
>>  MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
>> +MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>> +MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
>> +MEM_CGROUP_STAT_FILE_WRITEBACK, /* # of pages under writeback */
>> +MEM_CGROUP_STAT_FILE_UNSTABLE_NFS,  /* # of NFS unstable pages */
>>  MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
>>  /* incremented at every  pagein/pageout */
>>  MEM_CGROUP_EVENTS = MEM_CGROUP_STAT_DATA,
>> @@ -1626,6 +1629,48 @@ void mem_cgroup_update_page_stat(struct page *page,
>>  ClearPageCgroupFileMapped(pc);
>>  idx = MEM_CGROUP_STAT_FILE_MAPPED;
>>  break;
>> +
>> +case MEMCG_NR_FILE_DIRTY:
>> +/* Use Test{Set,Clear} to only un/charge the memcg once. */
>> +if (val > 0) {
>> +if (TestSetPageCgroupFileDirty(pc))
>> +/* already set */
>> +val = 0;
>> +} else {
>> +if (!TestClearPageCgroupFileDirty(pc))
>> +/* already cleared */
>> +val = 0;
>> +}
>> +idx = MEM_CGROUP_STAT_FILE_DIRTY;
>> +break;
>> +
>> +case MEMCG_NR_FILE_WRITEBACK:
>> +/*
>> + * This counter is adjusted while holding the mapping's
>> + * tree_lock.  Therefore there is no race between settings and
>> + * clearing of this flag.
>> + */
>
> nice description.
>
>> +if (val > 0)
>> +SetPageCgroupFileWriteback(pc);
>> +else
>> +ClearPageCgroupFileWriteback(pc);
>> +idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
>> +break;
>> +
>> +case MEMCG_NR_FILE_UNSTABLE_NFS:
>> +/* Use Test{Set,Clear} to only un/charge the memcg once. */
>> +if (val > 0) {
>> +if (TestSetPageCgroupFileUnstableNFS(pc))
>> +/* already set */
>> +val = 0;
>> +} else {
>> +if (!TestClearPageCgroupFileUnstableNFS(pc))
>> +/* already cleared */
>> +val = 0;
>> +}
>> +idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
>> +break;
>> +
>>  default:
>>  BUG();
>>  }
>> @@ -2133,6 +2178,16 @@ static void __mem_cgroup_commit_charge(struct 
>> mem_cgroup *mem,
>>  memcg_check_events(mem, pc->page);
>>  }
>>  
>> +static void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
>> +  struct mem_cgroup *to,
>> +  enum mem_cgroup_stat_index idx)
>> +{
>> +preempt_disable();
>> +__this_cpu_dec(from->stat->count[idx]);
>> +__this_cpu_inc(to->stat->count[idx]);
>> +preempt_enable();
>> +}
>> +
>>  /**
>>   * __mem_cgroup_move_account - move account of the page
>>   * @pc: page_cgroup of the page.
>> @@ -2159,13 +2214,18 @@ static void __mem_cgrou

[Devel] Re: [PATCH 10/10] memcg: check memcg dirty limits in page writeback

2010-10-05 Thread KAMEZAWA Hiroyuki
On Sun,  3 Oct 2010 23:58:05 -0700
Greg Thelen  wrote:

> If the current process is in a non-root memcg, then
> global_dirty_limits() will consider the memcg dirty limit.
> This allows different cgroups to have distinct dirty limits
> which trigger direct and background writeback at different
> levels.
> 
> Signed-off-by: Andrea Righi 
> Signed-off-by: Greg Thelen 

This patch seems good because of straightforward implementation.
I think worth to be tested in -mm tree.

Thank you very much.

Acked-by: KAMEZAWA Hiroyuki 

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits

2010-10-05 Thread Greg Thelen
KAMEZAWA Hiroyuki  writes:

> On Sun,  3 Oct 2010 23:58:03 -0700
> Greg Thelen  wrote:
>
>> Add cgroupfs interface to memcg dirty page limits:
>>   Direct write-out is controlled with:
>>   - memory.dirty_ratio
>>   - memory.dirty_bytes
>> 
>>   Background write-out is controlled with:
>>   - memory.dirty_background_ratio
>>   - memory.dirty_background_bytes
>> 
>> Signed-off-by: Andrea Righi 
>> Signed-off-by: Greg Thelen 
>
> Acked-by: KAMEZAWA Hiroyuki 
>
> a question below.
>
>
>> ---
>>  mm/memcontrol.c |   89 
>> +++
>>  1 files changed, 89 insertions(+), 0 deletions(-)
>> 
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 6ec2625..2d45a0a 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -100,6 +100,13 @@ enum mem_cgroup_stat_index {
>>  MEM_CGROUP_STAT_NSTATS,
>>  };
>>  
>> +enum {
>> +MEM_CGROUP_DIRTY_RATIO,
>> +MEM_CGROUP_DIRTY_BYTES,
>> +MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
>> +MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
>> +};
>> +
>>  struct mem_cgroup_stat_cpu {
>>  s64 count[MEM_CGROUP_STAT_NSTATS];
>>  };
>> @@ -4292,6 +4299,64 @@ static int mem_cgroup_oom_control_write(struct cgroup 
>> *cgrp,
>>  return 0;
>>  }
>>  
>> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
>> +{
>> +struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
>> +bool root;
>> +
>> +root = mem_cgroup_is_root(mem);
>> +
>> +switch (cft->private) {
>> +case MEM_CGROUP_DIRTY_RATIO:
>> +return root ? vm_dirty_ratio : mem->dirty_param.dirty_ratio;
>> +case MEM_CGROUP_DIRTY_BYTES:
>> +return root ? vm_dirty_bytes : mem->dirty_param.dirty_bytes;
>> +case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
>> +return root ? dirty_background_ratio :
>> +mem->dirty_param.dirty_background_ratio;
>> +case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
>> +return root ? dirty_background_bytes :
>> +mem->dirty_param.dirty_background_bytes;
>> +default:
>> +BUG();
>> +}
>> +}
>> +
>> +static int
>> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
>> +{
>> +struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
>> +int type = cft->private;
>> +
>> +if (cgrp->parent == NULL)
>> +return -EINVAL;
>> +if ((type == MEM_CGROUP_DIRTY_RATIO ||
>> + type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
>> +return -EINVAL;
>> +switch (type) {
>> +case MEM_CGROUP_DIRTY_RATIO:
>> +memcg->dirty_param.dirty_ratio = val;
>> +memcg->dirty_param.dirty_bytes = 0;
>> +break;
>> +case MEM_CGROUP_DIRTY_BYTES:
>> +memcg->dirty_param.dirty_bytes = val;
>> +memcg->dirty_param.dirty_ratio  = 0;
>> +break;
>> +case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
>> +memcg->dirty_param.dirty_background_ratio = val;
>> +memcg->dirty_param.dirty_background_bytes = 0;
>> +break;
>> +case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
>> +memcg->dirty_param.dirty_background_bytes = val;
>> +memcg->dirty_param.dirty_background_ratio = 0;
>> +break;
>
>
> Curiousis this same behavior as vm_dirty_ratio ?

I think this is same behavior as vm_dirty_ratio.  When vm_dirty_ratio is
changed then dirty_ratio_handler() will set vm_dirty_bytes=0.  When
vm_dirty_bytes is written dirty_bytes_handler() will set
vm_dirty_ratio=0.  So I think that the per-memcg dirty memory parameters
mimic the behavior of vm_dirty_ratio, vm_dirty_bytes and the other
global dirty parameters.

Am I missing your question?

> Thanks,
> -Kame
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 05/10] memcg: add dirty page accounting infrastructure

2010-10-05 Thread KAMEZAWA Hiroyuki
On Sun,  3 Oct 2010 23:58:00 -0700
Greg Thelen  wrote:

> Add memcg routines to track dirty, writeback, and unstable_NFS pages.
> These routines are not yet used by the kernel to count such pages.
> A later change adds kernel calls to these new routines.
> 
> Signed-off-by: Greg Thelen 
> Signed-off-by: Andrea Righi 

a small request. see below.

> ---
>  include/linux/memcontrol.h |3 +
>  mm/memcontrol.c|   89 
> 
>  2 files changed, 84 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7c7bec4..6303da1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -28,6 +28,9 @@ struct mm_struct;
>  /* Stats that can be updated by kernel. */
>  enum mem_cgroup_write_page_stat_item {
>   MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> + MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
> + MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
> + MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>  };
>  
>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 267d774..f40839f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -85,10 +85,13 @@ enum mem_cgroup_stat_index {
>*/
>   MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
>   MEM_CGROUP_STAT_RSS,   /* # of pages charged as anon rss */
> - MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>   MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
>   MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
>   MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> + MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
> + MEM_CGROUP_STAT_FILE_WRITEBACK, /* # of pages under writeback */
> + MEM_CGROUP_STAT_FILE_UNSTABLE_NFS,  /* # of NFS unstable pages */
>   MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
>   /* incremented at every  pagein/pageout */
>   MEM_CGROUP_EVENTS = MEM_CGROUP_STAT_DATA,
> @@ -1626,6 +1629,48 @@ void mem_cgroup_update_page_stat(struct page *page,
>   ClearPageCgroupFileMapped(pc);
>   idx = MEM_CGROUP_STAT_FILE_MAPPED;
>   break;
> +
> + case MEMCG_NR_FILE_DIRTY:
> + /* Use Test{Set,Clear} to only un/charge the memcg once. */
> + if (val > 0) {
> + if (TestSetPageCgroupFileDirty(pc))
> + /* already set */
> + val = 0;
> + } else {
> + if (!TestClearPageCgroupFileDirty(pc))
> + /* already cleared */
> + val = 0;
> + }
> + idx = MEM_CGROUP_STAT_FILE_DIRTY;
> + break;
> +
> + case MEMCG_NR_FILE_WRITEBACK:
> + /*
> +  * This counter is adjusted while holding the mapping's
> +  * tree_lock.  Therefore there is no race between settings and
> +  * clearing of this flag.
> +  */

nice description.

> + if (val > 0)
> + SetPageCgroupFileWriteback(pc);
> + else
> + ClearPageCgroupFileWriteback(pc);
> + idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
> + break;
> +
> + case MEMCG_NR_FILE_UNSTABLE_NFS:
> + /* Use Test{Set,Clear} to only un/charge the memcg once. */
> + if (val > 0) {
> + if (TestSetPageCgroupFileUnstableNFS(pc))
> + /* already set */
> + val = 0;
> + } else {
> + if (!TestClearPageCgroupFileUnstableNFS(pc))
> + /* already cleared */
> + val = 0;
> + }
> + idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
> + break;
> +
>   default:
>   BUG();
>   }
> @@ -2133,6 +2178,16 @@ static void __mem_cgroup_commit_charge(struct 
> mem_cgroup *mem,
>   memcg_check_events(mem, pc->page);
>  }
>  
> +static void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +   struct mem_cgroup *to,
> +   enum mem_cgroup_stat_index idx)
> +{
> + preempt_disable();
> + __this_cpu_dec(from->stat->count[idx]);
> + __this_cpu_inc(to->stat->count[idx]);
> + preempt_enable();
> +}
> +
>  /**
>   * __mem_cgroup_move_account - move account of the page
>   * @pc:  page_cgroup of the page.
> @@ -2159,13 +2214,18 @@ static void __mem_cgroup_move_account(struct 
> page_cgroup *pc,
>   VM_BUG_ON(!PageCgroupUsed(pc));
> 

[Devel] Re: [PATCH 09/10] writeback: make determine_dirtyable_memory() static.

2010-10-05 Thread KAMEZAWA Hiroyuki
On Sun,  3 Oct 2010 23:58:04 -0700
Greg Thelen  wrote:

> The determine_dirtyable_memory() function is not used outside of
> page writeback.  Make the routine static.  No functional change.
> Just a cleanup in preparation for a change that adds memcg dirty
> limits consideration into global_dirty_limits().
> 
> Signed-off-by: Andrea Righi 
> Signed-off-by: Greg Thelen 

Hmm.
Reviewed-by: KAMEZAWA Hiroyuki 

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits

2010-10-05 Thread KAMEZAWA Hiroyuki
On Sun,  3 Oct 2010 23:58:03 -0700
Greg Thelen  wrote:

> Add cgroupfs interface to memcg dirty page limits:
>   Direct write-out is controlled with:
>   - memory.dirty_ratio
>   - memory.dirty_bytes
> 
>   Background write-out is controlled with:
>   - memory.dirty_background_ratio
>   - memory.dirty_background_bytes
> 
> Signed-off-by: Andrea Righi 
> Signed-off-by: Greg Thelen 

Acked-by: KAMEZAWA Hiroyuki 

a question below.


> ---
>  mm/memcontrol.c |   89 
> +++
>  1 files changed, 89 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6ec2625..2d45a0a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -100,6 +100,13 @@ enum mem_cgroup_stat_index {
>   MEM_CGROUP_STAT_NSTATS,
>  };
>  
> +enum {
> + MEM_CGROUP_DIRTY_RATIO,
> + MEM_CGROUP_DIRTY_BYTES,
> + MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> + MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +};
> +
>  struct mem_cgroup_stat_cpu {
>   s64 count[MEM_CGROUP_STAT_NSTATS];
>  };
> @@ -4292,6 +4299,64 @@ static int mem_cgroup_oom_control_write(struct cgroup 
> *cgrp,
>   return 0;
>  }
>  
> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> + bool root;
> +
> + root = mem_cgroup_is_root(mem);
> +
> + switch (cft->private) {
> + case MEM_CGROUP_DIRTY_RATIO:
> + return root ? vm_dirty_ratio : mem->dirty_param.dirty_ratio;
> + case MEM_CGROUP_DIRTY_BYTES:
> + return root ? vm_dirty_bytes : mem->dirty_param.dirty_bytes;
> + case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> + return root ? dirty_background_ratio :
> + mem->dirty_param.dirty_background_ratio;
> + case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> + return root ? dirty_background_bytes :
> + mem->dirty_param.dirty_background_bytes;
> + default:
> + BUG();
> + }
> +}
> +
> +static int
> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> + int type = cft->private;
> +
> + if (cgrp->parent == NULL)
> + return -EINVAL;
> + if ((type == MEM_CGROUP_DIRTY_RATIO ||
> +  type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
> + return -EINVAL;
> + switch (type) {
> + case MEM_CGROUP_DIRTY_RATIO:
> + memcg->dirty_param.dirty_ratio = val;
> + memcg->dirty_param.dirty_bytes = 0;
> + break;
> + case MEM_CGROUP_DIRTY_BYTES:
> + memcg->dirty_param.dirty_bytes = val;
> + memcg->dirty_param.dirty_ratio  = 0;
> + break;
> + case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> + memcg->dirty_param.dirty_background_ratio = val;
> + memcg->dirty_param.dirty_background_bytes = 0;
> + break;
> + case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> + memcg->dirty_param.dirty_background_bytes = val;
> + memcg->dirty_param.dirty_background_ratio = 0;
> + break;


Curiousis this same behavior as vm_dirty_ratio ?


Thanks,
-Kame

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()

2010-10-05 Thread Greg Thelen
KAMEZAWA Hiroyuki  writes:

> On Sun,  3 Oct 2010 23:57:59 -0700
> Greg Thelen  wrote:
>
>> If pages are being migrated from a memcg, then updates to that
>> memcg's page statistics are protected by grabbing a bit spin lock
>> using lock_page_cgroup().  In an upcoming commit memcg dirty page
>> accounting will be updating memcg page accounting (specifically:
>> num writeback pages) from softirq.  Avoid a deadlocking nested
>> spin lock attempt by disabling interrupts on the local processor
>> when grabbing the page_cgroup bit_spin_lock in lock_page_cgroup().
>> This avoids the following deadlock:
>> statistic
>>   CPU 0 CPU 1
>> inc_file_mapped
>> rcu_read_lock
>>   start move
>>   synchronize_rcu
>> lock_page_cgroup
>>   softirq
>>   test_clear_page_writeback
>>   mem_cgroup_dec_page_stat(NR_WRITEBACK)
>>   rcu_read_lock
>>   lock_page_cgroup   /* deadlock */
>>   unlock_page_cgroup
>>   rcu_read_unlock
>> unlock_page_cgroup
>> rcu_read_unlock
>> 
>> By disabling interrupts in lock_page_cgroup, nested calls
>> are avoided.  The softirq would be delayed until after inc_file_mapped
>> enables interrupts when calling unlock_page_cgroup().
>> 
>> The normal, fast path, of memcg page stat updates typically
>> does not need to call lock_page_cgroup(), so this change does
>> not affect the performance of the common case page accounting.
>> 
>> Signed-off-by: Andrea Righi 
>> Signed-off-by: Greg Thelen 
>
> Nice Catch!
>
> But..hmm this wasn't necessary for FILE_MAPPED but necesary for new
> statistics, right ? (This affects the order of patches.)

This patch (disabling interrupts) is not needed until later patches (in
this series) update memcg statistics from softirq.  If we only had
FILE_MAPPED, then this patch would not be needed.  I placed this patch
before the following dependent patches that need it.  The opposite order
seemed wrong because it would introduce the possibility of the deadlock
until this patch was applied.  By having this patch come first there
should be no way to apply the series in order and see the mentioned
deadlock.

> Anyway
>
> Acked-By: KAMEZAWA Hiroyuki 
>
>
>> ---
>>  include/linux/page_cgroup.h |8 +-
>>  mm/memcontrol.c |   51 
>> +-
>>  2 files changed, 36 insertions(+), 23 deletions(-)
>> 
>> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
>> index b59c298..872f6b1 100644
>> --- a/include/linux/page_cgroup.h
>> +++ b/include/linux/page_cgroup.h
>> @@ -117,14 +117,18 @@ static inline enum zone_type page_cgroup_zid(struct 
>> page_cgroup *pc)
>>  return page_zonenum(pc->page);
>>  }
>>  
>> -static inline void lock_page_cgroup(struct page_cgroup *pc)
>> +static inline void lock_page_cgroup(struct page_cgroup *pc,
>> +unsigned long *flags)
>>  {
>> +local_irq_save(*flags);
>>  bit_spin_lock(PCG_LOCK, &pc->flags);
>>  }
>>  
>> -static inline void unlock_page_cgroup(struct page_cgroup *pc)
>> +static inline void unlock_page_cgroup(struct page_cgroup *pc,
>> +  unsigned long flags)
>>  {
>>  bit_spin_unlock(PCG_LOCK, &pc->flags);
>> +local_irq_restore(flags);
>>  }
>>  
>>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index f4259f4..267d774 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1599,6 +1599,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>>  struct mem_cgroup *mem;
>>  struct page_cgroup *pc = lookup_page_cgroup(page);
>>  bool need_unlock = false;
>> +unsigned long flags;
>>  
>>  if (unlikely(!pc))
>>  return;
>> @@ -1610,7 +1611,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>>  /* pc->mem_cgroup is unstable ? */
>>  if (unlikely(mem_cgroup_stealed(mem))) {
>>  /* take a lock against to access pc->mem_cgroup */
>> -lock_page_cgroup(pc);
>> +lock_page_cgroup(pc, &flags);
>>  need_unlock = true;
>>  mem = pc->mem_cgroup;
>>  if (!mem || !PageCgroupUsed(pc))
>> @@ -1633,7 +1634,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>>  
>>  out:
>>  if (unlikely(need_unlock))
>> -unlock_page_cgroup(pc);
>> +unlock_page_cgroup(pc, flags);
>>  rcu_read_unlock();
>>  return;
>>  }
>> @@ -2053,11 +2054,12 @@ struct mem_cgroup 
>> *try_get_mem_cgroup_from_page(struct page *page)
>>  struct page_cgroup *pc;
>>  unsigned short id;
>>  swp_entry_t ent;
>> +unsigned long flags;
>>  
>>  VM_BUG_ON(!PageLocked(page));
>>  
>>  pc = lookup_page_cgroup(page);
>> -lock_page_cgroup(pc);
>> +lock_page_cgroup(pc, &flag

[Devel] Re: [PATCH 07/10] memcg: add dirty limits to mem_cgroup

2010-10-05 Thread KAMEZAWA Hiroyuki
On Sun,  3 Oct 2010 23:58:02 -0700
Greg Thelen  wrote:

> Extend mem_cgroup to contain dirty page limits.  Also add routines
> allowing the kernel to query the dirty usage of a memcg.
> 
> These interfaces not used by the kernel yet.  A subsequent commit
> will add kernel calls to utilize these new routines.
> 
> Signed-off-by: Greg Thelen 
> Signed-off-by: Andrea Righi 

Seems nice.
Acked-by: KAMEZAWA Hiroyuki 

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 03/10] memcg: create extensible page stat update routines

2010-10-05 Thread Greg Thelen
KAMEZAWA Hiroyuki  writes:

> On Sun,  3 Oct 2010 23:57:58 -0700
> Greg Thelen  wrote:
>
>> Replace usage of the mem_cgroup_update_file_mapped() memcg
>> statistic update routine with two new routines:
>> * mem_cgroup_inc_page_stat()
>> * mem_cgroup_dec_page_stat()
>> 
>> As before, only the file_mapped statistic is managed.  However,
>> these more general interfaces allow for new statistics to be
>> more easily added.  New statistics are added with memcg dirty
>> page accounting.
>> 
>> Signed-off-by: Greg Thelen 
>> Signed-off-by: Andrea Righi 
>
> Acked-by: KAMEZAWA Hiroyuki 
>
> a nitpick. see below.
>
>> ---
>>  include/linux/memcontrol.h |   31 ---
>>  mm/memcontrol.c|   17 -
>>  mm/rmap.c  |4 ++--
>>  3 files changed, 38 insertions(+), 14 deletions(-)
>> 
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 159a076..7c7bec4 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -25,6 +25,11 @@ struct page_cgroup;
>>  struct page;
>>  struct mm_struct;
>>  
>> +/* Stats that can be updated by kernel. */
>> +enum mem_cgroup_write_page_stat_item {
>> +MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
>> +};
>> +
>>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>>  struct list_head *dst,
>>  unsigned long *scanned, int order,
>> @@ -121,7 +126,22 @@ static inline bool mem_cgroup_disabled(void)
>>  return false;
>>  }
>>  
>> -void mem_cgroup_update_file_mapped(struct page *page, int val);
>> +void mem_cgroup_update_page_stat(struct page *page,
>> + enum mem_cgroup_write_page_stat_item idx,
>> + int val);
>> +
>> +static inline void mem_cgroup_inc_page_stat(struct page *page,
>> +enum mem_cgroup_write_page_stat_item idx)
>> +{
>> +mem_cgroup_update_page_stat(page, idx, 1);
>> +}
>> +
>> +static inline void mem_cgroup_dec_page_stat(struct page *page,
>> +enum mem_cgroup_write_page_stat_item idx)
>> +{
>> +mem_cgroup_update_page_stat(page, idx, -1);
>> +}
>> +
>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>  gfp_t gfp_mask);
>>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>> @@ -293,8 +313,13 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, 
>> struct task_struct *p)
>>  {
>>  }
>>  
>> -static inline void mem_cgroup_update_file_mapped(struct page *page,
>> -int val)
>> +static inline void mem_cgroup_inc_page_stat(struct page *page,
>> +enum mem_cgroup_write_page_stat_item idx)
>> +{
>> +}
>> +
>> +static inline void mem_cgroup_dec_page_stat(struct page *page,
>> +enum mem_cgroup_write_page_stat_item idx)
>>  {
>>  }
>>  
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 512cb12..f4259f4 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1592,7 +1592,9 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, 
>> gfp_t mask)
>>   * possibility of race condition. If there is, we take a lock.
>>   */
>>  
>> -static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
>> +void mem_cgroup_update_page_stat(struct page *page,
>> + enum mem_cgroup_write_page_stat_item idx,
>> + int val)
>>  {
>>  struct mem_cgroup *mem;
>>  struct page_cgroup *pc = lookup_page_cgroup(page);
>> @@ -1615,30 +1617,27 @@ static void mem_cgroup_update_file_stat(struct page 
>> *page, int idx, int val)
>>  goto out;
>>  }
>>  
>> -this_cpu_add(mem->stat->count[idx], val);
>> -
>>  switch (idx) {
>> -case MEM_CGROUP_STAT_FILE_MAPPED:
>> +case MEMCG_NR_FILE_MAPPED:
>>  if (val > 0)
>>  SetPageCgroupFileMapped(pc);
>>  else if (!page_mapped(page))
>>  ClearPageCgroupFileMapped(pc);
>> +idx = MEM_CGROUP_STAT_FILE_MAPPED;
>>  break;
>>  default:
>>  BUG();
>>  }
>>  
>> +this_cpu_add(mem->stat->count[idx], val);
>> +
>
> Why you move this_cpu_add() placement ?
> (This placement is ok but I just wonder..)
>
> Thanks,
> -Kame

The reason this_cpu_add() is moved to after the switch is because the
switch is needed to convert the input parameter from an enum
mem_cgroup_write_page_stat_item (example: MEMCG_NR_FILE_MAPPED) to enum
mem_cgroup_stat_index (example: MEM_CGROUP_STAT_FILE_MAPPED) before
indexing into the count array.

Also in subsequent patches (in this series) "val" is updated depending
on page_cgroup flags before usage by this_cpu_add().
___
Containers mailing list
contain...@lists.linux-

[Devel] Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()

2010-10-05 Thread KAMEZAWA Hiroyuki
On Sun,  3 Oct 2010 23:57:59 -0700
Greg Thelen  wrote:

> If pages are being migrated from a memcg, then updates to that
> memcg's page statistics are protected by grabbing a bit spin lock
> using lock_page_cgroup().  In an upcoming commit memcg dirty page
> accounting will be updating memcg page accounting (specifically:
> num writeback pages) from softirq.  Avoid a deadlocking nested
> spin lock attempt by disabling interrupts on the local processor
> when grabbing the page_cgroup bit_spin_lock in lock_page_cgroup().
> This avoids the following deadlock:
> statistic
>   CPU 0 CPU 1
> inc_file_mapped
> rcu_read_lock
>   start move
>   synchronize_rcu
> lock_page_cgroup
>   softirq
>   test_clear_page_writeback
>   mem_cgroup_dec_page_stat(NR_WRITEBACK)
>   rcu_read_lock
>   lock_page_cgroup   /* deadlock */
>   unlock_page_cgroup
>   rcu_read_unlock
> unlock_page_cgroup
> rcu_read_unlock
> 
> By disabling interrupts in lock_page_cgroup, nested calls
> are avoided.  The softirq would be delayed until after inc_file_mapped
> enables interrupts when calling unlock_page_cgroup().
> 
> The normal, fast path, of memcg page stat updates typically
> does not need to call lock_page_cgroup(), so this change does
> not affect the performance of the common case page accounting.
> 
> Signed-off-by: Andrea Righi 
> Signed-off-by: Greg Thelen 

Nice Catch!

But..hmm this wasn't necessary for FILE_MAPPED but necesary for new
statistics, right ? (This affects the order of patches.)

Anyway

Acked-By: KAMEZAWA Hiroyuki 


> ---
>  include/linux/page_cgroup.h |8 +-
>  mm/memcontrol.c |   51 +-
>  2 files changed, 36 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index b59c298..872f6b1 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -117,14 +117,18 @@ static inline enum zone_type page_cgroup_zid(struct 
> page_cgroup *pc)
>   return page_zonenum(pc->page);
>  }
>  
> -static inline void lock_page_cgroup(struct page_cgroup *pc)
> +static inline void lock_page_cgroup(struct page_cgroup *pc,
> + unsigned long *flags)
>  {
> + local_irq_save(*flags);
>   bit_spin_lock(PCG_LOCK, &pc->flags);
>  }
>  
> -static inline void unlock_page_cgroup(struct page_cgroup *pc)
> +static inline void unlock_page_cgroup(struct page_cgroup *pc,
> +   unsigned long flags)
>  {
>   bit_spin_unlock(PCG_LOCK, &pc->flags);
> + local_irq_restore(flags);
>  }
>  
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f4259f4..267d774 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1599,6 +1599,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>   struct mem_cgroup *mem;
>   struct page_cgroup *pc = lookup_page_cgroup(page);
>   bool need_unlock = false;
> + unsigned long flags;
>  
>   if (unlikely(!pc))
>   return;
> @@ -1610,7 +1611,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>   /* pc->mem_cgroup is unstable ? */
>   if (unlikely(mem_cgroup_stealed(mem))) {
>   /* take a lock against to access pc->mem_cgroup */
> - lock_page_cgroup(pc);
> + lock_page_cgroup(pc, &flags);
>   need_unlock = true;
>   mem = pc->mem_cgroup;
>   if (!mem || !PageCgroupUsed(pc))
> @@ -1633,7 +1634,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>  
>  out:
>   if (unlikely(need_unlock))
> - unlock_page_cgroup(pc);
> + unlock_page_cgroup(pc, flags);
>   rcu_read_unlock();
>   return;
>  }
> @@ -2053,11 +2054,12 @@ struct mem_cgroup 
> *try_get_mem_cgroup_from_page(struct page *page)
>   struct page_cgroup *pc;
>   unsigned short id;
>   swp_entry_t ent;
> + unsigned long flags;
>  
>   VM_BUG_ON(!PageLocked(page));
>  
>   pc = lookup_page_cgroup(page);
> - lock_page_cgroup(pc);
> + lock_page_cgroup(pc, &flags);
>   if (PageCgroupUsed(pc)) {
>   mem = pc->mem_cgroup;
>   if (mem && !css_tryget(&mem->css))
> @@ -2071,7 +2073,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct 
> page *page)
>   mem = NULL;
>   rcu_read_unlock();
>   }
> - unlock_page_cgroup(pc);
> + unlock_page_cgroup(pc, flags);
>   return mem;
>  }
>  
> @@ -2084,13 +2086,15 @@ static void __mem_cgroup_commit_charge(struct 
> mem_cgroup *mem,
>struct page_cgroup *pc,
>enum charge_type ctype)
>  {
> + 

[Devel] Re: [PATCH 06/10] memcg: add kernel calls for memcg dirty page stats

2010-10-05 Thread KAMEZAWA Hiroyuki
On Sun,  3 Oct 2010 23:58:01 -0700
Greg Thelen  wrote:

> Add calls into memcg dirty page accounting.  Notify memcg when pages
> transition between clean, file dirty, writeback, and unstable nfs.
> This allows the memory controller to maintain an accurate view of
> the amount of its memory that is dirty.
> 
> Signed-off-by: Greg Thelen 
> Signed-off-by: Andrea Righi 

Acked-by: KAMEZAWA Hiroyuki 

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 03/10] memcg: create extensible page stat update routines

2010-10-05 Thread KAMEZAWA Hiroyuki
On Sun,  3 Oct 2010 23:57:58 -0700
Greg Thelen  wrote:

> Replace usage of the mem_cgroup_update_file_mapped() memcg
> statistic update routine with two new routines:
> * mem_cgroup_inc_page_stat()
> * mem_cgroup_dec_page_stat()
> 
> As before, only the file_mapped statistic is managed.  However,
> these more general interfaces allow for new statistics to be
> more easily added.  New statistics are added with memcg dirty
> page accounting.
> 
> Signed-off-by: Greg Thelen 
> Signed-off-by: Andrea Righi 

Acked-by: KAMEZAWA Hiroyuki 

a nitpick. see below.

> ---
>  include/linux/memcontrol.h |   31 ---
>  mm/memcontrol.c|   17 -
>  mm/rmap.c  |4 ++--
>  3 files changed, 38 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 159a076..7c7bec4 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -25,6 +25,11 @@ struct page_cgroup;
>  struct page;
>  struct mm_struct;
>  
> +/* Stats that can be updated by kernel. */
> +enum mem_cgroup_write_page_stat_item {
> + MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> +};
> +
>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>   struct list_head *dst,
>   unsigned long *scanned, int order,
> @@ -121,7 +126,22 @@ static inline bool mem_cgroup_disabled(void)
>   return false;
>  }
>  
> -void mem_cgroup_update_file_mapped(struct page *page, int val);
> +void mem_cgroup_update_page_stat(struct page *page,
> +  enum mem_cgroup_write_page_stat_item idx,
> +  int val);
> +
> +static inline void mem_cgroup_inc_page_stat(struct page *page,
> + enum mem_cgroup_write_page_stat_item idx)
> +{
> + mem_cgroup_update_page_stat(page, idx, 1);
> +}
> +
> +static inline void mem_cgroup_dec_page_stat(struct page *page,
> + enum mem_cgroup_write_page_stat_item idx)
> +{
> + mem_cgroup_update_page_stat(page, idx, -1);
> +}
> +
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>   gfp_t gfp_mask);
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> @@ -293,8 +313,13 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, 
> struct task_struct *p)
>  {
>  }
>  
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> - int val)
> +static inline void mem_cgroup_inc_page_stat(struct page *page,
> + enum mem_cgroup_write_page_stat_item idx)
> +{
> +}
> +
> +static inline void mem_cgroup_dec_page_stat(struct page *page,
> + enum mem_cgroup_write_page_stat_item idx)
>  {
>  }
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 512cb12..f4259f4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1592,7 +1592,9 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, 
> gfp_t mask)
>   * possibility of race condition. If there is, we take a lock.
>   */
>  
> -static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
> +void mem_cgroup_update_page_stat(struct page *page,
> +  enum mem_cgroup_write_page_stat_item idx,
> +  int val)
>  {
>   struct mem_cgroup *mem;
>   struct page_cgroup *pc = lookup_page_cgroup(page);
> @@ -1615,30 +1617,27 @@ static void mem_cgroup_update_file_stat(struct page 
> *page, int idx, int val)
>   goto out;
>   }
>  
> - this_cpu_add(mem->stat->count[idx], val);
> -
>   switch (idx) {
> - case MEM_CGROUP_STAT_FILE_MAPPED:
> + case MEMCG_NR_FILE_MAPPED:
>   if (val > 0)
>   SetPageCgroupFileMapped(pc);
>   else if (!page_mapped(page))
>   ClearPageCgroupFileMapped(pc);
> + idx = MEM_CGROUP_STAT_FILE_MAPPED;
>   break;
>   default:
>   BUG();
>   }
>  
> + this_cpu_add(mem->stat->count[idx], val);
> +

Why you move this_cpu_add() placement ?
(This placement is ok but I just wonder..)

Thanks,
-Kame

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel