[PATCH] fix multiple mark_page_accessed called when sequentially writing a file with blocksize less than PAGE_SIZE ,which might pollute the LRU.

2013-01-08 Thread Qiang Gao
sequential write to a file with blocksize less than PAGE_SIZE  will call
mark_page_accessed multiple times,

if (!pagevec_space(pvec))
__pagevec_lru_add(pvec, lru);
it seems this trick fix this problem,but not quite thoroughly. there's a chance
that when another page was added to the pvec before the 14th page was
secondly mark_page_accesseded, then the 14th page was still active.



diff --git a/fs/open.c b/fs/open.c
index 9b33c0c..a418419 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -674,6 +674,13 @@ int open_check_o_direct(struct file *f)
return 0;
 }

+
+static void
+file_w_state_init(struct file_w_state *wstat)
+{
+   wstat->prev_w_pos=-1;
+}
+
 static int do_dentry_open(struct file *f,
  int (*open)(struct inode *, struct file *),
  const struct cred *cred)
@@ -730,6 +737,7 @@ static int do_dentry_open(struct file *f,
f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);

file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
+   file_w_state_init(&f->f_wstat);

return 0;

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7617ee0..b90d3ff 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -746,6 +746,10 @@ struct file_ra_state {
loff_t prev_pos;/* Cache last read() position */
 };

+struct file_w_state {
+   loff_t prev_w_pos;
+};
+
 /*
  * Check if @index falls in the readahead windows.
  */
@@ -787,6 +791,7 @@ struct file {
struct fown_struct  f_owner;
const struct cred   *f_cred;
struct file_ra_statef_ra;
+   struct file_w_state f_wstat;

u64 f_version;
 #ifdef CONFIG_SECURITY
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..ea144a9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2287,6 +2287,9 @@ static ssize_t generic_perform_write(struct file *file,
long status = 0;
ssize_t written = 0;
unsigned int flags = 0;
+   unsigned long prev_pos=file->f_wstat.prev_w_pos;
+   unsigned long prev_off=(prev_pos & (PAGE_CACHE_SIZE -1));
+   pgoff_t prev_index=(prev_pos >> PAGE_CACHE_SHIFT);

/*
 * Copies from kernel address space cannot fail (NFSD is a big user).
@@ -2296,12 +2299,14 @@ static ssize_t generic_perform_write(struct file *file,

do {
struct page *page;
+   pgoff_t index;
unsigned long offset;   /* Offset into pagecache page */
unsigned long bytes;/* Bytes to write to page */
size_t copied;  /* Bytes copied from user */
void *fsdata;

offset = (pos & (PAGE_CACHE_SIZE - 1));
+   index = pos >> PAGE_CACHE_SHIFT;
bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
iov_iter_count(i));

@@ -2334,7 +2339,8 @@ again:
pagefault_enable();
flush_dcache_page(page);

-   mark_page_accessed(page);
+   if (index != prev_index || offset != prev_off)
+   mark_page_accessed(page);
status = a_ops->write_end(file, mapping, pos, bytes, copied,
page, fsdata);
if (unlikely(status < 0))
@@ -2359,6 +2365,8 @@ again:
}
pos += copied;
written += copied;
+   prev_index = index;
+   prev_off = offset;

balance_dirty_pages_ratelimited(mapping);
if (fatal_signal_pending(current)) {
@@ -2367,6 +2375,7 @@ again:
}
} while (iov_iter_count(i));

+   file->f_wstat.prev_w_pos=pos;
return written ? written : status;
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process hangs on do_exit when oom happens

2012-10-25 Thread Qiang Gao
On Thu, Oct 25, 2012 at 5:57 PM, Michal Hocko  wrote:
> On Wed 24-10-12 11:44:17, Qiang Gao wrote:
>> On Wed, Oct 24, 2012 at 1:43 AM, Balbir Singh  wrote:
>> > On Tue, Oct 23, 2012 at 3:45 PM, Michal Hocko  wrote:
>> >> On Tue 23-10-12 18:10:33, Qiang Gao wrote:
>> >>> On Tue, Oct 23, 2012 at 5:50 PM, Michal Hocko  wrote:
>> >>> > On Tue 23-10-12 15:18:48, Qiang Gao wrote:
>> >>> >> This process was moved to RT-priority queue when global oom-killer
>> >>> >> happened to boost the recovery of the system..
>> >>> >
>> >>> > Who did that? oom killer doesn't boost the priority (scheduling class)
>> >>> > AFAIK.
>> >>> >
>> >>> >> but it wasn't get properily dealt with. I still have no idea why where
>> >>> >> the problem is ..
>> >>> >
>> >>> > Well your configuration says that there is no runtime reserved for the
>> >>> > group.
>> >>> > Please refer to Documentation/scheduler/sched-rt-group.txt for more
>> >>> > information.
>> >>> >
>> >> [...]
>> >>> maybe this is not a upstream-kernel bug. the centos/redhat kernel
>> >>> would boost the process to RT prio when the process was selected
>> >>> by oom-killer.
>> >>
>> >> This still looks like your cpu controller is misconfigured. Even if the
>> >> task is promoted to be realtime.
>> >
>> >
>> > Precisely! You need to have rt bandwidth enabled for RT tasks to run,
>> > as a workaround please give the groups some RT bandwidth and then work
>> > out the migration to RT and what should be the defaults on the distro.
>> >
>> > Balbir
>>
>>
>> see https://patchwork.kernel.org/patch/719411/
>
> The patch surely "fixes" your problem but the primary fault here is the
> mis-configured cpu cgroup. If the value for the bandwidth is zero by
> default then all realtime processes in the group a screwed. The value
> should be set to something more reasonable.
> I am not familiar with the cpu controller but it seems that
> alloc_rt_sched_group needs some treat. Care to look into it and send a
> patch to the cpu controller and cgroup maintainers, please?
>
> --
> Michal Hocko
> SUSE Labs

I'm trying to fix the problem. but no substantive progress yet.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process hangs on do_exit when oom happens

2012-10-23 Thread Qiang Gao
On Wed, Oct 24, 2012 at 1:43 AM, Balbir Singh  wrote:
> On Tue, Oct 23, 2012 at 3:45 PM, Michal Hocko  wrote:
>> On Tue 23-10-12 18:10:33, Qiang Gao wrote:
>>> On Tue, Oct 23, 2012 at 5:50 PM, Michal Hocko  wrote:
>>> > On Tue 23-10-12 15:18:48, Qiang Gao wrote:
>>> >> This process was moved to RT-priority queue when global oom-killer
>>> >> happened to boost the recovery of the system..
>>> >
>>> > Who did that? oom killer doesn't boost the priority (scheduling class)
>>> > AFAIK.
>>> >
>>> >> but it wasn't get properily dealt with. I still have no idea why where
>>> >> the problem is ..
>>> >
>>> > Well your configuration says that there is no runtime reserved for the
>>> > group.
>>> > Please refer to Documentation/scheduler/sched-rt-group.txt for more
>>> > information.
>>> >
>> [...]
>>> maybe this is not a upstream-kernel bug. the centos/redhat kernel
>>> would boost the process to RT prio when the process was selected
>>> by oom-killer.
>>
>> This still looks like your cpu controller is misconfigured. Even if the
>> task is promoted to be realtime.
>
>
> Precisely! You need to have rt bandwidth enabled for RT tasks to run,
> as a workaround please give the groups some RT bandwidth and then work
> out the migration to RT and what should be the defaults on the distro.
>
> Balbir


see https://patchwork.kernel.org/patch/719411/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process hangs on do_exit when oom happens

2012-10-23 Thread Qiang Gao
On Tue, Oct 23, 2012 at 5:50 PM, Michal Hocko  wrote:
> On Tue 23-10-12 15:18:48, Qiang Gao wrote:
>> This process was moved to RT-priority queue when global oom-killer
>> happened to boost the recovery of the system..
>
> Who did that? oom killer doesn't boost the priority (scheduling class)
> AFAIK.
>
>> but it wasn't get properily dealt with. I still have no idea why where
>> the problem is ..
>
> Well your configuration says that there is no runtime reserved for the
> group.
> Please refer to Documentation/scheduler/sched-rt-group.txt for more
> information.
>
>> On Tue, Oct 23, 2012 at 12:40 PM, Balbir Singh  wrote:
>> > On Tue, Oct 23, 2012 at 9:05 AM, Qiang Gao  wrote:
>> >> information about the system is in the attach file "information.txt"
>> >>
>> >> I can not reproduce it in the upstream 3.6.0 kernel..
>> >>
>> >> On Sat, Oct 20, 2012 at 12:04 AM, Michal Hocko  wrote:
>> >>> On Wed 17-10-12 18:23:34, gaoqiang wrote:
>> >>>> I looked up nothing useful with google,so I'm here for help..
>> >>>>
>> >>>> when this happens:  I use memcg to limit the memory use of a
>> >>>> process,and when the memcg cgroup was out of memory,
>> >>>> the process was oom-killed   however,it cannot really complete the
>> >>>> exiting. here is the some information
>> >>>
>> >>> How many tasks are in the group and what kind of memory do they use?
>> >>> Is it possible that you were hit by the same issue as described in
>> >>> 79dfdacc memcg: make oom_lock 0 and 1 based rather than counter.
>> >>>
>> >>>> OS version:  centos6.22.6.32.220.7.1
>> >>>
>> >>> Your kernel is quite old and you should be probably asking your
>> >>> distribution to help you out. There were many fixes since 2.6.32.
>> >>> Are you able to reproduce the same issue with the current vanila kernel?
>> >>>
>> >>>> /proc/pid/stack
>> >>>> ---
>> >>>>
>> >>>> [] __cond_resched+0x2a/0x40
>> >>>> [] unmap_vmas+0xb49/0xb70
>> >>>> [] exit_mmap+0x7e/0x140
>> >>>> [] mmput+0x58/0x110
>> >>>> [] exit_mm+0x11d/0x160
>> >>>> [] do_exit+0x1ad/0x860
>> >>>> [] do_group_exit+0x41/0xb0
>> >>>> [] get_signal_to_deliver+0x1e8/0x430
>> >>>> [] do_notify_resume+0xf4/0x8b0
>> >>>> [] int_signal+0x12/0x17
>> >>>> [] 0x
>> >>>
>> >>> This looks strange because this is just an exit part which shouldn't
>> >>> deadlock or anything. Is this stack stable? Have you tried to take check
>> >>> it more times?
>> >
>> > Looking at information.txt, I found something interesting
>> >
>> > rt_rq[0]:/1314
>> >   .rt_nr_running : 1
>> >   .rt_throttled  : 1
>> >   .rt_time   : 0.856656
>> >   .rt_runtime: 0.00
>> >
>> >
>> > cfs_rq[0]:/1314
>> >   .exec_clock: 8738.133429
>> >   .MIN_vruntime  : 0.01
>> >   .min_vruntime  : 8739.371271
>> >   .max_vruntime  : 0.01
>> >   .spread: 0.00
>> >   .spread0   : -9792.24
>> >   .nr_spread_over: 1
>> >   .nr_running: 0
>> >   .load  : 0
>> >   .load_avg  : 7376.722880
>> >   .load_period   : 7.203830
>> >   .load_contrib  : 1023
>> >   .load_tg   : 1023
>> >   .se->exec_start: 282004.715064
>> >   .se->vruntime  : 18435.664560
>> >   .se->sum_exec_runtime  : 8738.133429
>> >   .se->wait_start: 0.00
>> >   .se->sleep_start   : 0.00
>> >   .se->block_start   : 0.00
>> >   .se->sleep_max : 0.00
>> >   .se->block_max : 0.00
>> >   .se->exec_max  : 77.977054
>> >   .se->slice_max : 0.00
>&

Re: process hangs on do_exit when oom happens

2012-10-23 Thread Qiang Gao
global-oom is the right thing to do. but oom-killed-process hanging on
do_exit is not the normal behavior

On Tue, Oct 23, 2012 at 5:01 PM, Sha Zhengju  wrote:
> On 10/23/2012 11:35 AM, Qiang Gao wrote:
>>
>> information about the system is in the attach file "information.txt"
>>
>> I can not reproduce it in the upstream 3.6.0 kernel..
>>
>> On Sat, Oct 20, 2012 at 12:04 AM, Michal Hocko  wrote:
>>>
>>> On Wed 17-10-12 18:23:34, gaoqiang wrote:
>>>>
>>>> I looked up nothing useful with google,so I'm here for help..
>>>>
>>>> when this happens:  I use memcg to limit the memory use of a
>>>> process,and when the memcg cgroup was out of memory,
>>>> the process was oom-killed   however,it cannot really complete the
>>>> exiting. here is the some information
>>>
>>> How many tasks are in the group and what kind of memory do they use?
>>> Is it possible that you were hit by the same issue as described in
>>> 79dfdacc memcg: make oom_lock 0 and 1 based rather than counter.
>>>
>>>> OS version:  centos6.22.6.32.220.7.1
>>>
>>> Your kernel is quite old and you should be probably asking your
>>> distribution to help you out. There were many fixes since 2.6.32.
>>> Are you able to reproduce the same issue with the current vanila kernel?
>>>
>>>> /proc/pid/stack
>>>> ---
>>>>
>>>> [] __cond_resched+0x2a/0x40
>>>> [] unmap_vmas+0xb49/0xb70
>>>> [] exit_mmap+0x7e/0x140
>>>> [] mmput+0x58/0x110
>>>> [] exit_mm+0x11d/0x160
>>>> [] do_exit+0x1ad/0x860
>>>> [] do_group_exit+0x41/0xb0
>>>> [] get_signal_to_deliver+0x1e8/0x430
>>>> [] do_notify_resume+0xf4/0x8b0
>>>> [] int_signal+0x12/0x17
>>>> [] 0x
>>>
>>> This looks strange because this is just an exit part which shouldn't
>>> deadlock or anything. Is this stack stable? Have you tried to take check
>>> it more times?
>>>
>
> Does the machine only have about 700M memory? I also find something
> in the log file:
>
> Node 0 DMA free:2772kB min:72kB low:88kB high:108kB present:15312kB..
> lowmem_reserve[]: 0 674 674 674
> Node 0 DMA32 free:*3172kB* min:3284kB low:4104kB high:4924kB
> present:690712kB ..
> lowmem_reserve[]: 0 0 0 0
> 0 pages in swap cache
> Swap cache stats: add 0, delete 0, find 0/0
> Free swap  = 0kB
> Total swap = 0kB
> 179184 pages RAM  ==>  179184 * 4 / 1024 = *700M*
> 6773 pages reserved
>
>
> Note that the free memory of DMA32(3172KB) is lower than min watermark,
> which means the global is under pressure now. What's more the swap is off,
> so the global oom is normal behavior.
>
>
> Thanks,
> Sha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process hangs on do_exit when oom happens

2012-10-23 Thread Qiang Gao
this is just an example to show how to reproduce. actually,the first time I saw
this situation was on a machine with 288G RAM with many tasks running and
we limit 30G for each.  but finanlly, no one exceeds this limit the the system
oom.


On Tue, Oct 23, 2012 at 4:35 PM, Michal Hocko  wrote:
> On Tue 23-10-12 11:35:52, Qiang Gao wrote:
>> I'm sure this is a global-oom,not cgroup-oom. [the dmesg output in the end]
>
> Yes this is the global oom killer because:
>> cglimit -M 700M ./tt
>> then after global-oom,the process hangs..
>
>> 179184 pages RAM
>
> So you have ~700M of RAM so the memcg limit is basically pointless as it
> cannot be reached...
> --
> Michal Hocko
> SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process hangs on do_exit when oom happens

2012-10-23 Thread Qiang Gao
This process was moved to RT-priority  queue when global oom-killer
happened to boost the recovery
of the system.. but it wasn't get properily dealt with. I still have
no idea why where the problem is ..
On Tue, Oct 23, 2012 at 12:40 PM, Balbir Singh  wrote:
> On Tue, Oct 23, 2012 at 9:05 AM, Qiang Gao  wrote:
>> information about the system is in the attach file "information.txt"
>>
>> I can not reproduce it in the upstream 3.6.0 kernel..
>>
>> On Sat, Oct 20, 2012 at 12:04 AM, Michal Hocko  wrote:
>>> On Wed 17-10-12 18:23:34, gaoqiang wrote:
>>>> I looked up nothing useful with google,so I'm here for help..
>>>>
>>>> when this happens:  I use memcg to limit the memory use of a
>>>> process,and when the memcg cgroup was out of memory,
>>>> the process was oom-killed   however,it cannot really complete the
>>>> exiting. here is the some information
>>>
>>> How many tasks are in the group and what kind of memory do they use?
>>> Is it possible that you were hit by the same issue as described in
>>> 79dfdacc memcg: make oom_lock 0 and 1 based rather than counter.
>>>
>>>> OS version:  centos6.22.6.32.220.7.1
>>>
>>> Your kernel is quite old and you should be probably asking your
>>> distribution to help you out. There were many fixes since 2.6.32.
>>> Are you able to reproduce the same issue with the current vanila kernel?
>>>
>>>> /proc/pid/stack
>>>> ---
>>>>
>>>> [] __cond_resched+0x2a/0x40
>>>> [] unmap_vmas+0xb49/0xb70
>>>> [] exit_mmap+0x7e/0x140
>>>> [] mmput+0x58/0x110
>>>> [] exit_mm+0x11d/0x160
>>>> [] do_exit+0x1ad/0x860
>>>> [] do_group_exit+0x41/0xb0
>>>> [] get_signal_to_deliver+0x1e8/0x430
>>>> [] do_notify_resume+0xf4/0x8b0
>>>> [] int_signal+0x12/0x17
>>>> [] 0x
>>>
>>> This looks strange because this is just an exit part which shouldn't
>>> deadlock or anything. Is this stack stable? Have you tried to take check
>>> it more times?
>
> Looking at information.txt, I found something interesting
>
> rt_rq[0]:/1314
>   .rt_nr_running : 1
>   .rt_throttled  : 1
>   .rt_time   : 0.856656
>   .rt_runtime: 0.00
>
>
> cfs_rq[0]:/1314
>   .exec_clock: 8738.133429
>   .MIN_vruntime  : 0.01
>   .min_vruntime  : 8739.371271
>   .max_vruntime  : 0.01
>   .spread: 0.00
>   .spread0   : -9792.24
>   .nr_spread_over: 1
>   .nr_running: 0
>   .load  : 0
>   .load_avg  : 7376.722880
>   .load_period   : 7.203830
>   .load_contrib  : 1023
>   .load_tg   : 1023
>   .se->exec_start: 282004.715064
>   .se->vruntime  : 18435.664560
>   .se->sum_exec_runtime  : 8738.133429
>   .se->wait_start: 0.00
>   .se->sleep_start   : 0.00
>   .se->block_start   : 0.00
>   .se->sleep_max : 0.00
>   .se->block_max : 0.00
>   .se->exec_max  : 77.977054
>   .se->slice_max : 0.00
>   .se->wait_max  : 2.664779
>   .se->wait_sum  : 29.970575
>   .se->wait_count: 102
>   .se->load.weight   : 2
>
> So 1314 is a real time process and
>
> cpu.rt_period_us:
> 100
> --
> cpu.rt_runtime_us:
> 0
>
> When did tt move to being a Real Time process (hint: see nr_running
> and nr_throttled)?
>
> Balbir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: process hangs on do_exit when oom happens

2012-10-21 Thread Qiang Gao
I don't know whether  the process will exit finally, bug this stack
lasts for hours, which is obviously unnormal.
The situation:  we use a command calld "cglimit" to fork-and-exec the
worker process,and the "cglimit" will
set some limitation on the worker with cgroup. for now,we limit the
memory,and we also use cpu cgroup,but with
no limiation,so when the worker is running, the cgroup directory looks
like following:

/cgroup/memory/worker : this directory limit the memory
/cgroup/cpu/worker :with no limit,but worker process is in.

for some reason(some other process we didn't consider),  the worker
process invoke global oom-killer,
not cgroup-oom-killer.  then the worker process hangs there.

Actually, if we didn't set the worker process into the cpu cgroup,
this will never happens.

On Sat, Oct 20, 2012 at 12:04 AM, Michal Hocko  wrote:
>
> On Wed 17-10-12 18:23:34, gaoqiang wrote:
> > I looked up nothing useful with google,so I'm here for help..
> >
> > when this happens:  I use memcg to limit the memory use of a
> > process,and when the memcg cgroup was out of memory,
> > the process was oom-killed   however,it cannot really complete the
> > exiting. here is the some information
>
> How many tasks are in the group and what kind of memory do they use?
> Is it possible that you were hit by the same issue as described in
> 79dfdacc memcg: make oom_lock 0 and 1 based rather than counter.
>
> > OS version:  centos6.22.6.32.220.7.1
>
> Your kernel is quite old and you should be probably asking your
> distribution to help you out. There were many fixes since 2.6.32.
> Are you able to reproduce the same issue with the current vanila kernel?
>
> > /proc/pid/stack
> > ---
> >
> > [] __cond_resched+0x2a/0x40
> > [] unmap_vmas+0xb49/0xb70
> > [] exit_mmap+0x7e/0x140
> > [] mmput+0x58/0x110
> > [] exit_mm+0x11d/0x160
> > [] do_exit+0x1ad/0x860
> > [] do_group_exit+0x41/0xb0
> > [] get_signal_to_deliver+0x1e8/0x430
> > [] do_notify_resume+0xf4/0x8b0
> > [] int_signal+0x12/0x17
> > [] 0x
>
> This looks strange because this is just an exit part which shouldn't
> deadlock or anything. Is this stack stable? Have you tried to take check
> it more times?
>
> --
> Michal Hocko
> SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/