from:"Tetsuo Handa"

Re: [PATCH] mm: don't warn about allocations which stall for too long

2017-11-09 Thread Tetsuo Handa

Michal Hocko wrote:
> On Thu 09-11-17 10:34:46, peter enderborg wrote:
> > On 11/09/2017 09:52 AM, Michal Hocko wrote:
> > > I am not sure. I would rather see a tracepoint to mark the allocator
> > > entry. This would allow both 1) measuring the allocation latency (to
> > > compare it to the trace_mm_page_alloc and 2) check for stalls with
> > > arbitrary user defined timeout (just print all allocations which haven't
> > > passed trace_mm_page_alloc for the given amount of time).
> > 
> > Traces are not that expensive, but there are more than few in calls
> > in this path. And Im trying to keep it as small that it can used for
> > maintenance versions too.
> >
> > This is suggestion is a quick way of keeping the current solution for
> > the ones that are interested the slow allocations. If we are going
> > for a solution with a time-out parameter from the user what interface
> > do you suggest to do this configuration. A filter parameter for the
> > event?
> 
> I meant to do all that in postprocessing. So no specific API is needed,
> just parse the output. Anyway, it seems that the printk will be put in
> shape in a forseeable future so we might preserve the stall warning
> after all. It is the show_mem part which is interesting during that
> warning.

I don't know whether printk() will be put in shape in a foreseeable future.
The rule that "do not try to printk() faster than the kernel can write to
consoles" will remain no matter how printk() changes. Unless asynchronous
approach like https://lwn.net/Articles/723447/ is used, I think we can't
obtain useful information.

Re: [RFC] hung task: check specific tasks for long uninterruptible sleep state

2017-11-08 Thread Tetsuo Handa

Lingutla Chandrasekhar wrote:
> Some tasks may intentionally moves to uninterruptable sleep state,
> which shouldn't leads to khungtask panics, as those are recoverable
> hungs. So to avoid false hung reports, add an option to select tasks
> to be monitored and report/panic them only.

What are backtraces of such tasks? Please point the locations in the code.

If they are absolutely recoverable, why can't we let themselves declare that
"I'm intentionally in uninterruptible state. But there is no dependency that
prevents me from recovering. So, please ignore me." using per "struct
task_struct" flags rather than introducing userspace controlled interface?

Re: [PATCH v3] printk: Add console owner and waiter logic to loadbalance console writes

2017-11-07 Thread Tetsuo Handa

Sergey Senozhatsky wrote:
> On (11/06/17 21:06), Tetsuo Handa wrote:
> > I tried your patch with warn_alloc() torture. It did not cause lockups.
> > But I felt that possibility of failing to flush last second messages (such
> > as SysRq-c or SysRq-b) to consoles has increased. Is this psychological?
> 
> do I understand it correctly that there are "lost messages"?

Messages that were not written to consoles.

It seems that due to warn_alloc() torture, messages added to logbuf by SysRq
were not printed. When printk() is storming, we can't expect last second
messages are printed to consoles.

> hm... wondering if this is a regression.

I wish that there is an API which allows waiting for printk() messages
to be flushed. That is, associate serial number to each printk() request,
and allow callers to know current serial number (i.e. number of messages
in the logbuf) and current completed number (i.e. number of messages written
to consoles) and compare like time_after(). That is.

  printk("Hello\n");
  printk("World\n");
  seq = get_printk_queued_seq();

and then

  while (get_printk_completed_seq() < seq)
msleep(1); // if caller can wait unconditionally.

or

  while (get_printk_completed_seq() < seq && !fatal_signal_pending(current))
msleep(1); // if caller can wait unless killed.

or

  now = jiffies;
  while (get_printk_completed_seq() < seq && time_before(jiffies, now + HZ))
cpu_relax(); // if caller cannot schedule().

and so on, for watchdog kernel threads can try to wait for printk() to flush.



By the way, I got below lockdep splat. Too timing dependent to reproduce.
( http://I-love.SAKURA.ne.jp/tmp/serial-20171107.txt.xz )

[8.631358] fbcon: svgadrmfb (fb0) is primary device
[8.654303] Console: switching to colour frame buffer device 160x48
[8.659706]
[8.659707] ==
[8.659707] WARNING: possible circular locking dependency detected
[8.659708] 4.14.0-rc8+ #312 Not tainted
[8.659708] --
[8.659708] systemd-udevd/184 is trying to acquire lock:
[8.659709]  (&(&par->dirty.lock)->rlock){}, at: [] 
vmw_fb_dirty_mark+0x33/0xf0 [vmwgfx]
[8.659710]
[8.659710] but task is already holding lock:
[8.659711]  (console_owner){}, at: [] 
console_unlock+0x173/0x5c0
[8.659712]
[8.659712] which lock already depends on the new lock.
[8.659712]
[8.659713]
[8.659713] the existing dependency chain (in reverse order) is:
[8.659713]
[8.659713] -> #4 (console_owner){}:
[8.659715]lock_acquire+0x6d/0x90
[8.659715]console_unlock+0x199/0x5c0
[8.659715]vprintk_emit+0x4e0/0x540
[8.659715]vprintk_default+0x1a/0x20
[8.659716]vprintk_func+0x22/0x60
[8.659716]printk+0x53/0x6a
[8.659716]start_kernel+0x6d/0x4c3
[8.659717]x86_64_start_reservations+0x24/0x26
[8.659717]x86_64_start_kernel+0x6f/0x72
[8.659717]verify_cpu+0x0/0xfb
[8.659717]
[8.659718] -> #3 (logbuf_lock){..-.}:
[8.659719]lock_acquire+0x6d/0x90
[8.659719]_raw_spin_lock+0x2c/0x40
[8.659719]vprintk_emit+0x7f/0x540
[8.659720]vprintk_deferred+0x1b/0x40
[8.659720]printk_deferred+0x53/0x6f
[8.659720]unwind_next_frame.part.6+0x1ed/0x200
[8.659721]unwind_next_frame+0x11/0x20
[8.659721]__save_stack_trace+0x7d/0xf0
[8.659721]save_stack_trace+0x16/0x20
[8.659721]set_track+0x6b/0x1a0
[8.659722]free_debug_processing+0xce/0x2aa
[8.659722]__slab_free+0x1eb/0x2c0
[8.659722]kmem_cache_free+0x19a/0x1e0
[8.659723]file_free_rcu+0x23/0x40
[8.659723]rcu_process_callbacks+0x2ed/0x5b0
[8.659723]__do_softirq+0xf2/0x21e
[8.659724]irq_exit+0xe7/0x100
[8.659724]smp_apic_timer_interrupt+0x56/0x90
[8.659724]apic_timer_interrupt+0xa7/0xb0
[8.659724]vmw_send_msg+0x91/0xc0 [vmwgfx]
[8.659725]
[8.659725] -> #2 (&(&n->list_lock)->rlock){-.-.}:
[8.659726]lock_acquire+0x6d/0x90
[8.659726]_raw_spin_lock+0x2c/0x40
[8.659727]get_partial_node.isra.87+0x44/0x2c0
[8.659727]___slab_alloc+0x262/0x610
[8.659727]__slab_alloc+0x41/0x85
[8.659727]kmem_cache_alloc+0x18a/0x1d0
[8.659728]__debug_object_init+0x3e2/0x400
[8.659728]debug_object_activate+0x12d/0x200
[8.659728]add_timer+0x6f/0x1b0
[8.659729]__queue_delayed_work+0x5b/0xa0
[8.659729]queue_delayed_work_on+0x4f/0xa0
[8.659729]check_lifetime+0x194/0x2e0
[8.659730]process_one_work+0x1c1/0x3e0
[8.659730]worker_thread

Re: [PATCH v3] printk: Add console owner and waiter logic to load balance console writes

2017-11-06 Thread Tetsuo Handa

I tried your patch with warn_alloc() torture. It did not cause lockups.
But I felt that possibility of failing to flush last second messages (such
as SysRq-c or SysRq-b) to consoles has increased. Is this psychological?

-- vmcore-dmesg start --
[  169.016198] postgres cpuset=
[  169.032544]  filemap_fault+0x311/0x790
[  169.047745] /
[  169.047780]  mems_allowed=0
[  169.050577]  ? xfs_ilock+0x126/0x1a0 [xfs]
[  169.062769]  mems_allowed=0
[  169.065754]  ? down_read_nested+0x3a/0x60
[  169.065783]  ? xfs_ilock+0x126/0x1a0 [xfs]
[  189.700206] sysrq: SysRq :
[  189.700639]  __xfs_filemap_fault.isra.19+0x3f/0xe0 [xfs]
[  189.700799]  xfs_filemap_fault+0xb/0x10 [xfs]
[  189.703981] Trigger a crash
[  189.707032]  __do_fault+0x19/0xa0
[  189.710008] BUG: unable to handle kernel
[  189.713387]  __handle_mm_fault+0xbb3/0xda0
[  189.716473] NULL pointer dereference
[  189.719674]  handle_mm_fault+0x14f/0x300
[  189.722969]  at   (null)
[  189.722974] IP: sysrq_handle_crash+0x3b/0x70
[  189.726156]  ? handle_mm_fault+0x39/0x300
[  189.729537] PGD 1170dc067
[  189.732841]  __do_page_fault+0x23e/0x4f0
[  189.735876] P4D 1170dc067
[  189.739171]  do_page_fault+0x30/0x80
[  189.742323] PUD 1170dd067
[  189.745437]  page_fault+0x22/0x30
[  189.748329] PMD 0
[  189.751106] RIP: 0033:0x650390
[  189.756583] RSP: 002b:7fffef6b1568 EFLAGS: 00010246
[  189.759574] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
[  189.762607] RAX:  RBX: 7fffef6b1594 RCX: 7fae949caa20
[  189.765665] Modules linked in:
[  189.768423] RDX: 0008 RSI:  RDI: 
[  189.768425] RBP: 7fffef6b1590 R08: 0002 R09: 0010
[  189.771478]  ip6t_rpfilter
[  189.774297] R10: 0001 R11: 0246 R12: 
[  189.777016]  ipt_REJECT
[  189.779366] R13:  R14: 7fae969787e0 R15: 0004
[  189.782114]  nf_reject_ipv4
[  189.784839] CPU: 7 PID: 6959 Comm: sleep Not tainted 4.14.0-rc8+ #302
[  189.785113] Mem-Info:
-- vmcore-dmesg end --

-- serial console start --
[  168.975447] Mem-Info:
[  168.975453] active_anon:827953 inactive_anon:3376 isolated_anon:0
[  168.975453]  active_file:55 inactive_file:449 isolated_file:246
[  168.975453]  unevictable:0 dirty:2 writeback:68 unstable:0
[  168.975453]  slab_reclaimable:4344 slab_unreclaimable:36066
[  168.975453]  mapped:2250 shmem:3543 pagetables:9568 bounce:0
[  168.975453]  free:21398 free_pcp:175 free_cma:0
[  168.975458] Node 0 active_anon:3311812kB inactive_anon:13504kB 
active_file:220kB inactive_file:1796kB unevictable:0kB isolated(anon):0kB 
isolated(file):984kB mapped:9000kB dirty:8kB writeback:272kB shmem:14172kB 
shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2869248kB writeback_tmp:0kB 
unstable:0kB all_unreclaimable? no
[  168.975460] Node 0 DMA free:14756kB min:288kB low:360kB high:432kB 
active_anon:1088kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB 
kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB 
free_cma:0kB
[  168.975482] lowmem_reserve[]: 0 2686 3619 3619
[  168.975489] Node 0 DMA32 free:53624kB min:49956kB low:62444kB high:74932kB 
active_anon:2691088kB inactive_anon:16kB active_file:0kB inactive_file:0kB 
unevictable:0kB writepending:0kB present:3129152kB managed:2751400kB 
mlocked:0kB kernel_stack:32kB pagetables:4232kB bounce:0kB free_pcp:0kB 
local_pcp:0kB free_cma:0kB
[  168.975494] lowmem_reserve[]: 0 0 932 932
[  168.975501] Node 0 Normal free:17212kB min:17336kB low:21668kB high:26000kB 
active_anon:619636kB inactive_anon:13488kB active_file:220kB 
inactive_file:2872kB unevictable:0kB writepending:280kB present:1048576kB 
managed:954828kB mlocked:0kB kernel_stack:22784kB pagetables:34012kB bounce:0kB 
free_pcp:700kB local_pcp:40kB free_cma:0kB
[  168.975505] lowmem_reserve[]: 0 0 0 0
[  168.975512] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (UM) 
2*128kB (UM) 2*256kB (UM) 1*512kB (M) 1*1024kB (U) 0*2048kB 3*4096kB (M) = 
14756kB
[  168.975536] Node 0 DMA32: 18*4kB (U) 14*8kB (UM) 14*16kB (UE) 37*32kB (UE) 
9*64kB (UE) 4*128kB (UME) 1*256kB (M) 3*512kB (UME) 2*1024kB (UE) 3*2048kB 
(UME) 10*4096kB (M) = 53624kB
[0.00] Linux version 4.14.0-rc8+ (root@localhost.localdomain) (gcc 
version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC)) #302 SMP Mon Nov 6 12:15:00 
JST 2017
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-4.14.0-rc8+ 
root=UUID=98df1583-260a-423a-a193-182dade5d085 ro security=none 
sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 
irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off 
udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug 
transparent_hugepage=never disable_cpu_apicid=0 elfcorehdr=867704K
-- serial console end --

Re: [PATCH v17 4/6] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-11-04 Thread Tetsuo Handa

Wei Wang wrote:
> On 11/03/2017 07:25 PM, Tetsuo Handa wrote:
> >> @@ -184,8 +307,12 @@ static unsigned fill_balloon(struct virtio_balloon 
> >> *vb, size_t num)
> >>   
> >>num_allocated_pages = vb->num_pfns;
> >>/* Did we get any? */
> >> -  if (vb->num_pfns != 0)
> >> -  tell_host(vb, vb->inflate_vq);
> >> +  if (vb->num_pfns) {
> >> +  if (use_sg)
> >> +  tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
> > Please describe why tell_host_sgs() can work without __GFP_DIRECT_RECLAIM 
> > allocation,
> > for tell_host_sgs() is called with vb->balloon_lock mutex held.
> 
> Essentially, 
> tell_host_sgs()-->send_balloon_page_sg()-->add_one_sg()-->virtqueue_add_inbuf(
>  
> , , num=1 ,,GFP_KERNEL)
> won't need any memory allocation, because we always add one sg (i.e. 
> num=1) each time. That memory
> allocation option is only used when multiple sgs are added (i.e. num > 
> 1) and the implementation inside virtqueue_add_inbuf
> need allocation of indirect descriptor table.
> 
> We could also add some comments above the function to explain a little 
> about this if necessary.

Yes, please do so.

Or maybe replace GFP_KERNEL with GFP_NOWAIT or 0. Though Michael might remove 
that GFP
argument ( 
http://lkml.kernel.org/r/201710022344.jii17368.hqtlomjoosf...@i-love.sakura.ne.jp
 ).

> > If this is inside vb->balloon_lock mutex (isn't this?), xb_set_page() must 
> > not
> > use __GFP_DIRECT_RECLAIM allocation, for leak_balloon_sg_oom() will be 
> > blocked
> > on vb->balloon_lock mutex.
> 
> OK. Since the preload() doesn't need too much memory (< 4K in total), 
> how about GFP_NOWAIT here?

Maybe GFP_NOWAIT | __GFP_NOWARN ?

Re: [PATCH v3] printk: Add console owner and waiter logic to loadbalance console writes

2017-11-04 Thread Tetsuo Handa

Tetsuo Handa wrote:
> John Hubbard wrote:
> > On 11/03/2017 02:46 PM, John Hubbard wrote:
> > > On 11/03/2017 04:54 AM, Steven Rostedt wrote:
> > >> On Fri, 3 Nov 2017 07:21:21 -0400
> > >> Steven Rostedt  wrote:
> > [...]
> > >>
> > >> I'll condense the patch to show what I mean:
> > >>
> > >> To become a waiter, a task must do the following:
> > >>
> > >> +printk_safe_enter_irqsave(flags);
> > >> +
> > >> +raw_spin_lock(&console_owner_lock);
> > >> +owner = READ_ONCE(console_owner);
> > >> +waiter = READ_ONCE(console_waiter);
> 
> When CPU0 is writing to consoles after "console_owner = current;",
> what prevents from CPU1 and CPU2 concurrently reached this line from
> seeing waiter == false && owner != NULL && owner != current (which will
> concurrently set console_waiter = true and spin = true) without
> using atomic instructions?

Oops. I overlooked that console_owner_lock is held.

Re: [PATCH v3] printk: Add console owner and waiter logic to loadbalance console writes

2017-11-04 Thread Tetsuo Handa

John Hubbard wrote:
> On 11/03/2017 02:46 PM, John Hubbard wrote:
> > On 11/03/2017 04:54 AM, Steven Rostedt wrote:
> >> On Fri, 3 Nov 2017 07:21:21 -0400
> >> Steven Rostedt  wrote:
> [...]
> >>
> >> I'll condense the patch to show what I mean:
> >>
> >> To become a waiter, a task must do the following:
> >>
> >> +  printk_safe_enter_irqsave(flags);
> >> +
> >> +  raw_spin_lock(&console_owner_lock);
> >> +  owner = READ_ONCE(console_owner);
> >> +  waiter = READ_ONCE(console_waiter);

When CPU0 is writing to consoles after "console_owner = current;",
what prevents from CPU1 and CPU2 concurrently reached this line from
seeing waiter == false && owner != NULL && owner != current (which will
concurrently set console_waiter = true and spin = true) without
using atomic instructions?

> >> +  if (!waiter && owner && owner != current) {
> >> +  WRITE_ONCE(console_waiter, true);
> >> +  spin = true;
> >> +  }
> >> +  raw_spin_unlock(&console_owner_lock);
> >>
> >>
> >> The new waiter gets set only if there isn't already a waiter *and*
> >> there is an owner that is not current (and with the printk_safe_enter I
> >> don't think that is even needed).
> >>
> >> +  while (!READ_ONCE(console_waiter))
> >> +  cpu_relax();
> >>
> >> The spin is outside the spin lock. But only the owner can clear it.
> >>
> >> Now the owner is doing a loop of this (with interrupts disabled)
> >>
> >> +  raw_spin_lock(&console_owner_lock);
> >> +  console_owner = current;
> >> +  raw_spin_unlock(&console_owner_lock);
> >>
> >> Write to consoles.
> >>
> >> +  raw_spin_lock(&console_owner_lock);
> >> +  waiter = READ_ONCE(console_waiter);
> >> +  console_owner = NULL;
> >> +  raw_spin_unlock(&console_owner_lock);
> >>
> >> +  if (waiter)
> >> +  break;
> >>
> >> At this moment console_owner is NULL, and no new waiters can happen.
> >> The next owner will be the waiter that is spinning.
> >>
> >> +  if (waiter) {
> >> +  WRITE_ONCE(console_waiter, false);
> >>
> >> There is no possibility of another task sneaking in and becoming a
> >> waiter at this moment. The console_owner was cleared under spin lock,
> >> and a waiter is only set under the same spin lock if owner is set.
> >> There will be no new owner sneaking in because to become the owner, you
> >> must have the console lock. Since it is never released between the time
> >> the owner clears console_waiter and the waiter takes the console lock,
> >> there is no race.
> > 
> > Yes, you are right of course. That does close the window. Sorry about
> > missing that point.
> > 
> > I'll try to quickly put together a small patch on top of this, that
> > shows a simplification, to just use an atomic compare and swap between a
> > global atomic value, and a local (on the stack) flag value, just in
> > case that is of interest.
> > 
> > thanks
> > john h
> 
> Just a follow-up: I was unable to simplify this; the atomic compare-and-swap
> approach merely made it different, rather than smaller or simpler.

Why no need to use [cmp]xchg() approach?

> 
> So, after spending a fair amount of time with the patch, it looks good to me,
> for whatever that's worth. :) Thanks again for explaining the locking details.
> 
> thanks
> john h
> 
> > 
> >>
> >> -- Steve

Re: [PATCH] mm,page_alloc: Update comment for last second allocation attempt.

2017-11-03 Thread Tetsuo Handa

Michal Hocko wrote:
> On Fri 03-11-17 23:08:35, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 03-11-17 22:46:29, Tetsuo Handa wrote:
> > > [...]
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index c274960..547e9cb 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -3312,11 +3312,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t 
> > > > *nodemask, const char *fmt, ...)
> > > > }
> > > >  
> > > > /*
> > > > -* Go through the zonelist yet one more time, keep very high 
> > > > watermark
> > > > -* here, this is only to catch a parallel oom killing, we must 
> > > > fail if
> > > > -* we're still under heavy pressure. But make sure that this 
> > > > reclaim
> > > > -* attempt shall not depend on __GFP_DIRECT_RECLAIM && 
> > > > !__GFP_NORETRY
> > > > -* allocation which will never fail due to oom_lock already 
> > > > held.
> > > > +* This allocation attempt must not depend on 
> > > > __GFP_DIRECT_RECLAIM &&
> > > > +* !__GFP_NORETRY allocation which will never fail due to 
> > > > oom_lock
> > > > +* already held. And since this allocation attempt does not 
> > > > sleep,
> > > > +* there is no reason we must use high watermark here.
> > > >  */
> > > > page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
> > > >   ~__GFP_DIRECT_RECLAIM, order,
> > > 
> > > Which patch does this depend on?
> > 
> > This patch is preparation for "mm,oom: Move last second allocation to inside
> > the OOM killer." patch in order to use changelog close to what you 
> > suggested.
> > That is, I will move this comment and get_page_from_freelist() together to
> > alloc_pages_before_oomkill(), after we recorded why using ALLOC_WMARK_HIGH.
> 
> Is it really worth a separate patch, though? Aren't you overcomplicating
> things again?

It is really worth a separate patch, for you don't want to include the paragraph
below into "mm,oom: Move last second allocation to inside the OOM killer." patch

  > __alloc_pages_may_oom() is doing last second allocation attempt using
  > ALLOC_WMARK_HIGH before calling out_of_memory(). This had two reasons.
  > 
  > The first reason is explained in the comment that it aims to catch
  > potential parallel OOM killing. But there is no longer parallel OOM
  > killing (in the sense that out_of_memory() is called "concurrently")
  > because we serialize out_of_memory() calls using oom_lock.
  > 
  > The second reason is explained by Andrea Arcangeli (who added that code)
  > that it aims to reduce the likelihood of OOM livelocks and be sure to
  > invoke the OOM killer. There was a risk of livelock or anyway of delayed
  > OOM killer invocation if ALLOC_WMARK_MIN is used, for relying on last
  > few pages which are constantly allocated and freed in the meantime will
  > not improve the situation.
  
  > But there is no longer possibility of OOM
  > livelocks or failing to invoke the OOM killer because we need to mask
  > __GFP_DIRECT_RECLAIM for last second allocation attempt because oom_lock
  > prevents __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations which last
  > second allocation attempt indirectly involve from failing.
  
  I really fail to see how this has anything to do with the paragraph
  above. We are not talking about the reclaim for the last attempt. We are
  talking about reclaim that might have happened in _other_ context. Why
  don't you simply stick with the changelog which I've suggested and which
  is much more clear and easier to read.

while I want to avoid blindly copying or moving outdated comments.

Re: [PATCH] mm,page_alloc: Update comment for last second allocation attempt.

2017-11-03 Thread Tetsuo Handa

Michal Hocko wrote:
> On Fri 03-11-17 22:46:29, Tetsuo Handa wrote:
> [...]
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c274960..547e9cb 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3312,11 +3312,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t 
> > *nodemask, const char *fmt, ...)
> > }
> >  
> > /*
> > -* Go through the zonelist yet one more time, keep very high watermark
> > -* here, this is only to catch a parallel oom killing, we must fail if
> > -* we're still under heavy pressure. But make sure that this reclaim
> > -* attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> > -* allocation which will never fail due to oom_lock already held.
> > +* This allocation attempt must not depend on __GFP_DIRECT_RECLAIM &&
> > +* !__GFP_NORETRY allocation which will never fail due to oom_lock
> > +* already held. And since this allocation attempt does not sleep,
> > +* there is no reason we must use high watermark here.
> >  */
> > page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
> >   ~__GFP_DIRECT_RECLAIM, order,
> 
> Which patch does this depend on?

This patch is preparation for "mm,oom: Move last second allocation to inside
the OOM killer." patch in order to use changelog close to what you suggested.
That is, I will move this comment and get_page_from_freelist() together to
alloc_pages_before_oomkill(), after we recorded why using ALLOC_WMARK_HIGH.

[PATCH] mm,page_alloc: Update comment for last second allocation attempt.

2017-11-03 Thread Tetsuo Handa

__alloc_pages_may_oom() is doing last second allocation attempt using
ALLOC_WMARK_HIGH before calling out_of_memory(). This had two reasons.

The first reason is explained in the comment that it aims to catch
potential parallel OOM killing. But there is no longer parallel OOM
killing (in the sense that out_of_memory() is called "concurrently")
because we serialize out_of_memory() calls using oom_lock.

The second reason is explained by Andrea Arcangeli (who added that code)
that it aims to reduce the likelihood of OOM livelocks and be sure to
invoke the OOM killer. There was a risk of livelock or anyway of delayed
OOM killer invocation if ALLOC_WMARK_MIN is used, for relying on last
few pages which are constantly allocated and freed in the meantime will
not improve the situation. But there is no longer possibility of OOM
livelocks or failing to invoke the OOM killer because we need to mask
__GFP_DIRECT_RECLAIM for last second allocation attempt because oom_lock
prevents __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations which last
second allocation attempt indirectly involve from failing.

Since the OOM killer does not always kill a process consuming significant
amount of memory (the OOM killer kills a process with highest OOM score
(or instead one of its children if any)), there will be cases where
ALLOC_WMARK_HIGH fails and ALLOC_WMARK_MIN succeeds.
Since the gap between ALLOC_WMARK_HIGH and ALLOC_WMARK_MIN can be changed
by /proc/sys/vm/min_free_kbytes parameter, using ALLOC_WMARK_MIN for last
second allocation attempt might be better for minimizing number of OOM
victims. But that change should be done in a separate patch. This patch
just clarifies that ALLOC_WMARK_HIGH is an arbitrary choice.

Signed-off-by: Tetsuo Handa 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
---
 mm/page_alloc.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c274960..547e9cb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3312,11 +3312,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, 
const char *fmt, ...)
}
 
/*
-* Go through the zonelist yet one more time, keep very high watermark
-* here, this is only to catch a parallel oom killing, we must fail if
-* we're still under heavy pressure. But make sure that this reclaim
-* attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
-* allocation which will never fail due to oom_lock already held.
+* This allocation attempt must not depend on __GFP_DIRECT_RECLAIM &&
+* !__GFP_NORETRY allocation which will never fail due to oom_lock
+* already held. And since this allocation attempt does not sleep,
+* there is no reason we must use high watermark here.
 */
page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
  ~__GFP_DIRECT_RECLAIM, order,
-- 
1.8.3.1

Re: [PATCH v17 4/6] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-11-03 Thread Tetsuo Handa

Wei Wang wrote:
> @@ -164,6 +284,8 @@ static unsigned fill_balloon(struct virtio_balloon *vb, 
> size_t num)
>   break;
>   }
>  
> + if (use_sg && xb_set_page(vb, page, &pfn_min, &pfn_max) < 0)

Isn't this leaking "page" ?

> + break;
>   balloon_page_push(&pages, page);
>   }
>  



> @@ -184,8 +307,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, 
> size_t num)
>  
>   num_allocated_pages = vb->num_pfns;
>   /* Did we get any? */
> - if (vb->num_pfns != 0)
> - tell_host(vb, vb->inflate_vq);
> + if (vb->num_pfns) {
> + if (use_sg)
> + tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);

Please describe why tell_host_sgs() can work without __GFP_DIRECT_RECLAIM 
allocation,
for tell_host_sgs() is called with vb->balloon_lock mutex held.

> + else
> + tell_host(vb, vb->inflate_vq);
> + }
>   mutex_unlock(&vb->balloon_lock);
>  
>   return num_allocated_pages;



> @@ -223,7 +353,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
> size_t num)
>   page = balloon_page_dequeue(vb_dev_info);
>   if (!page)
>   break;
> - set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> + if (use_sg) {
> + if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0)

Isn't this leaking "page" ?

If this is inside vb->balloon_lock mutex (isn't this?), xb_set_page() must not
use __GFP_DIRECT_RECLAIM allocation, for leak_balloon_sg_oom() will be blocked
on vb->balloon_lock mutex.

> + break;
> + } else {
> + set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> + }
> +
>   list_add(&page->lru, &pages);
>   vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>   }

Re: [PATCH v17 3/6] mm/balloon_compaction.c: split balloon page allocation and enqueue

2017-11-03 Thread Tetsuo Handa

Wei Wang wrote:
> Here's a detailed analysis of the deadlock by Tetsuo Handa:
> 
> In leak_balloon(), mutex_lock(&vb->balloon_lock) is called in order to
> serialize against fill_balloon(). But in fill_balloon(),
> alloc_page(GFP_HIGHUSER[_MOVABLE] | __GFP_NOMEMALLOC | __GFP_NORETRY) is
> called with vb->balloon_lock mutex held. Since GFP_HIGHUSER[_MOVABLE]
> implies __GFP_DIRECT_RECLAIM | __GFP_IO | __GFP_FS, despite __GFP_NORETRY
> is specified, this allocation attempt might indirectly depend on somebody
> else's __GFP_DIRECT_RECLAIM memory allocation. And such indirect
> __GFP_DIRECT_RECLAIM memory allocation might call leak_balloon() via
> virtballoon_oom_notify() via blocking_notifier_call_chain() callback via
> out_of_memory() when it reached __alloc_pages_may_oom() and held oom_lock
> mutex. Since vb->balloon_lock mutex is already held by fill_balloon(), it
> will cause OOM lockup. Thus, do not wait for vb->balloon_lock mutex if
> leak_balloon() is called from out_of_memory().

Please drop "Thus, do not wait for vb->balloon_lock mutex if leak_balloon()
is called from out_of_memory()." part. This is not what this patch will do.

> 
> Thread1Thread2
> fill_balloon()
>  takes a balloon_lock
>   balloon_page_enqueue()
>alloc_page(GFP_HIGHUSER_MOVABLE)
> direct reclaim (__GFP_FS context)  takes a fs lock
>  waits for that fs lock alloc_page(GFP_NOFS)
>  __alloc_pages_may_oom()
>   takes the oom_lock
>out_of_memory()
> blocking_notifier_call_chain()
>  leak_balloon()
>tries to take that
>  balloon_lock and deadlocks

Re: [PATCH v17 1/6] lib/xbitmap: Introduce xbitmap

2017-11-03 Thread Tetsuo Handa

I'm commenting without understanding the logic.

Wei Wang wrote:
> +
> +bool xb_preload(gfp_t gfp);
> +

Want __must_check annotation, for __radix_tree_preload() is marked
with __must_check annotation. By error failing to check result of
xb_preload() will lead to preemption kept disabled unexpectedly.



> +int xb_set_bit(struct xb *xb, unsigned long bit)
> +{
> + int err;
> + unsigned long index = bit / IDA_BITMAP_BITS;
> + struct radix_tree_root *root = &xb->xbrt;
> + struct radix_tree_node *node;
> + void **slot;
> + struct ida_bitmap *bitmap;
> + unsigned long ebit;
> +
> + bit %= IDA_BITMAP_BITS;
> + ebit = bit + 2;
> +
> + err = __radix_tree_create(root, index, 0, &node, &slot);
> + if (err)
> + return err;
> + bitmap = rcu_dereference_raw(*slot);
> + if (radix_tree_exception(bitmap)) {
> + unsigned long tmp = (unsigned long)bitmap;
> +
> + if (ebit < BITS_PER_LONG) {
> + tmp |= 1UL << ebit;
> + rcu_assign_pointer(*slot, (void *)tmp);
> + return 0;
> + }
> + bitmap = this_cpu_xchg(ida_bitmap, NULL);
> + if (!bitmap)

Please write locking rules, in order to explain how memory
allocated by __radix_tree_create() will not leak.

> + return -EAGAIN;
> + memset(bitmap, 0, sizeof(*bitmap));
> + bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
> + rcu_assign_pointer(*slot, bitmap);
> + }
> +
> + if (!bitmap) {
> + if (ebit < BITS_PER_LONG) {
> + bitmap = (void *)((1UL << ebit) |
> + RADIX_TREE_EXCEPTIONAL_ENTRY);
> + __radix_tree_replace(root, node, slot, bitmap, NULL,
> + NULL);
> + return 0;
> + }
> + bitmap = this_cpu_xchg(ida_bitmap, NULL);
> + if (!bitmap)

Same here.

> + return -EAGAIN;
> + memset(bitmap, 0, sizeof(*bitmap));
> + __radix_tree_replace(root, node, slot, bitmap, NULL, NULL);
> + }
> +
> + __set_bit(bit, bitmap->bitmap);
> + return 0;
> +}



> +void xb_clear_bit(struct xb *xb, unsigned long bit)
> +{
> + unsigned long index = bit / IDA_BITMAP_BITS;
> + struct radix_tree_root *root = &xb->xbrt;
> + struct radix_tree_node *node;
> + void **slot;
> + struct ida_bitmap *bitmap;
> + unsigned long ebit;
> +
> + bit %= IDA_BITMAP_BITS;
> + ebit = bit + 2;
> +
> + bitmap = __radix_tree_lookup(root, index, &node, &slot);
> + if (radix_tree_exception(bitmap)) {
> + unsigned long tmp = (unsigned long)bitmap;
> +
> + if (ebit >= BITS_PER_LONG)
> + return;
> + tmp &= ~(1UL << ebit);
> + if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
> + __radix_tree_delete(root, node, slot);
> + else
> + rcu_assign_pointer(*slot, (void *)tmp);
> + return;
> + }
> +
> + if (!bitmap)
> + return;
> +
> + __clear_bit(bit, bitmap->bitmap);
> + if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {

Please write locking rules, in order to explain how double kfree() and/or
use-after-free can be avoided.

> + kfree(bitmap);
> + __radix_tree_delete(root, node, slot);
> + }
> +}



> +void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long 
> end)
> +{
> + struct radix_tree_root *root = &xb->xbrt;
> + struct radix_tree_node *node;
> + void **slot;
> + struct ida_bitmap *bitmap;
> + unsigned int nbits;
> +
> + for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> + unsigned long index = start / IDA_BITMAP_BITS;
> + unsigned long bit = start % IDA_BITMAP_BITS;
> +
> + bitmap = __radix_tree_lookup(root, index, &node, &slot);
> + if (radix_tree_exception(bitmap)) {
> + unsigned long ebit = bit + 2;
> + unsigned long tmp = (unsigned long)bitmap;
> +
> + nbits = min(end - start + 1, BITS_PER_LONG - ebit);
> +
> + if (ebit >= BITS_PER_LONG)
> + continue;
> + bitmap_clear(&tmp, ebit, nbits);
> + if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
> + __radix_tree_delete(root, node, slot);
> + else
> + rcu_assign_pointer(*slot, (void *)tmp);
> + } else if (bitmap) {
> + nbits = min(end - start + 1, IDA_BITMAP_BITS - bit);
> +
> + if (nbits != IDA_BITMAP_BITS)
> + bitmap_clear(bitmap->bitmap, bit, nbits);
> +
> + if (nbi

[PATCH v2 2/2] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation.

2017-11-02 Thread Tetsuo Handa

if (current->mm &&
  (fatal_signal_pending(current) || task_will_free_mem(current)))

as a condition to try allocation from memory reserves with the risk of OOM
lockup, but reports like [1] were impossible. Linux 4.8+ are regressed
compared to Linux 4.7 due to the risk of needlessly selecting more OOM
victims.

Although commit cd04ae1e2dc8e365 ("mm, oom: do not rely on TIF_MEMDIE for
memory reserves access") mitigated this regression by not requiring each
OOM victim thread to call task_will_free_mem(current), some of OOM victim
threads which are between post __gfp_pfmemalloc_flags(gfp_mask) and pre
mutex_trylock(&oom_lock) (a race window which that commit cannot close)
can call out_of_memory() without ever trying ALLOC_OOM allocation.

Therefore, this patch allows OOM victims to use ALLOC_OOM watermark
for last second allocation attempt.

[1] 
http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde...@caviumnetworks.com

Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped 
tasks")
Reported-by: Manish Jaggi 
Signed-off-by: Tetsuo Handa 
Cc: Michal Hocko 
Cc: Oleg Nesterov 
Cc: Vladimir Davydov 
Cc: David Rientjes 
---
 mm/page_alloc.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1607326..45e763e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4111,13 +4111,19 @@ struct page *alloc_pages_before_oomkill(const struct 
oom_control *oc)
 * !__GFP_NORETRY allocation which will never fail due to oom_lock
 * already held. And since this allocation attempt does not sleep,
 * there is no reason we must use high watermark here.
+* But anyway, make sure that OOM victims can try ALLOC_OOM watermark
+* in case they haven't tried ALLOC_OOM watermark.
 */
int alloc_flags = ALLOC_CPUSET | ALLOC_WMARK_HIGH;
gfp_t gfp_mask = oc->gfp_mask | __GFP_HARDWALL;
+   int reserve_flags;
 
if (!oc->ac)
return NULL;
gfp_mask &= ~__GFP_DIRECT_RECLAIM;
+   reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
+   if (reserve_flags)
+   alloc_flags = reserve_flags;
return get_page_from_freelist(gfp_mask, oc->order, alloc_flags, oc->ac);
 }
 
-- 
1.8.3.1

[PATCH v2 1/2] mm,oom: Move last second allocation to inside the OOM killer.

2017-11-02 Thread Tetsuo Handa

__alloc_pages_may_oom() is doing last second allocation attempt using
ALLOC_WMARK_HIGH before calling out_of_memory(). This had two reasons.

The first reason is explained in the comment that it aims to catch
potential parallel OOM killing. But there is no longer parallel OOM
killing (in the sense that out_of_memory() is called "concurrently")
because we serialize out_of_memory() calls using oom_lock.

The second reason is explained by Andrea Arcangeli (who added that code)
that it aims to reduce the likelihood of OOM livelocks and be sure to
invoke the OOM killer. There was a risk of livelock or anyway of delayed
OOM killer invocation if ALLOC_WMARK_MIN is used, for relying on last
few pages which are constantly allocated and freed in the meantime will
not improve the situation. But there is no longer possibility of OOM
livelocks or failing to invoke the OOM killer because we need to mask
__GFP_DIRECT_RECLAIM for last second allocation attempt because oom_lock
prevents __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations which last
second allocation attempt indirectly involve from failing.

However, parallel OOM killing still exists (in the sense that
out_of_memory() is called "consecutively"). Sometimes doing last second
allocation attempt after selecting an OOM victim can succeed because
somebody (maybe previously killed OOM victims) might have managed to free
memory while we were selecting an OOM victim which can take quite some
time, for setting MMF_OOM_SKIP by exiting OOM victims is not serialized
by oom_lock. This suggests that giving up last second allocation attempt
as soon as ALLOC_WMARK_HIGH as of before selecting an OOM victim fails
can be pre-mature. Therefore, this patch moves last second allocation
attempt to after selecting an OOM victim. This patch is expected to reduce
the time window for potentially pre-mature OOM killing considerably.

Since the OOM killer does not always kill a process consuming significant
amount of memory (the OOM killer kills a process with highest OOM score
(or instead one of its children if any)), there will be cases where
ALLOC_WMARK_HIGH fails and ALLOC_WMARK_MIN succeeds.
Since the gap between ALLOC_WMARK_HIGH and ALLOC_WMARK_MIN can be changed
by /proc/sys/vm/min_free_kbytes parameter, using ALLOC_WMARK_MIN for last
second allocation attempt might be better for minimizing number of OOM
victims. But that change should be done in a separate patch. This patch
just clarifies that ALLOC_WMARK_HIGH is an arbitrary choice.

Signed-off-by: Tetsuo Handa 
Suggested-by: Michal Hocko 
Cc: Andrea Arcangeli 
Cc: Johannes Weiner 
---
 include/linux/oom.h | 13 +
 mm/oom_kill.c   | 14 ++
 mm/page_alloc.c | 41 -
 3 files changed, 51 insertions(+), 17 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4c..5ac2556 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -13,6 +13,8 @@
 struct notifier_block;
 struct mem_cgroup;
 struct task_struct;
+struct alloc_context;
+struct page;
 
 /*
  * Details of the page allocation that triggered the oom killer that are used 
to
@@ -37,6 +39,15 @@ struct oom_control {
 */
const int order;
 
+   /* Context for really last second allocation attempt. */
+   const struct alloc_context *ac;
+   /*
+* Set by the OOM killer if ac != NULL and last second allocation
+* attempt succeeded. If ac != NULL, the caller must check for
+* page != NULL.
+*/
+   struct page *page;
+
/* Used by oom implementation, do not set */
unsigned long totalpages;
struct task_struct *chosen;
@@ -101,6 +112,8 @@ extern unsigned long oom_badness(struct task_struct *p,
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern struct page *alloc_pages_before_oomkill(const struct oom_control *oc);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 26add8a..452e35c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1072,6 +1072,9 @@ bool out_of_memory(struct oom_control *oc)
if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
+   oc->page = alloc_pages_before_oomkill(oc);
+   if (oc->page)
+   return true;
get_task_struct(current);
oc->chosen = current;
oom_kill_process(oc, "Out of memory 
(oom_kill_allocating_task)");
@@ -1079,6 +1082,17 @@ bool out_of_memory(struct oom_control *oc)
}
 
select_bad_process(oc);
+   /*
+* Try really last second allocation attempt after we selected an OOM
+* v

Re: [PATCH 1/2] mm,oom: Move last second allocation to inside the OOM killer.

2017-11-02 Thread Tetsuo Handa

Michal Hocko wrote:
> I would really suggest you to stick with the changelog I have suggested.
> 
Well, I think that this patch needs to clarify why using ALLOC_WMARK_HIGH.

> On Wed 01-11-17 20:54:27, Tetsuo Handa wrote:
> [...]
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 26add8a..118ecdb 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -870,6 +870,19 @@ static void oom_kill_process(struct oom_control *oc, 
> > const char *message)
> > }
> > task_unlock(p);
> >  
> > +   /*
> > +* Try really last second allocation attempt after we selected an OOM
> > +* victim, for somebody might have managed to free memory while we were
> > +* selecting an OOM victim which can take quite some time.
> > +*/
> > +   if (oc->ac) {
> > +   oc->page = alloc_pages_before_oomkill(oc);
> 
> I would stick the oc->ac check inside alloc_pages_before_oomkill.

OK.

> 
> > +   if (oc->page) {
> > +   put_task_struct(p);
> > +   return;
> > +   }
> > +   }
> > +
> > if (__ratelimit(&oom_rs))
> > dump_header(oc, p);
> >  
> > @@ -1081,6 +1094,16 @@ bool out_of_memory(struct oom_control *oc)
> > select_bad_process(oc);
> > /* Found nothing?!?! Either we hang forever, or we panic. */
> > if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
> > +   /*
> > +* Try really last second allocation attempt, for somebody
> > +* might have managed to free memory while we were trying to
> > +* find an OOM victim.
> > +*/
> > +   if (oc->ac) {
> > +   oc->page = alloc_pages_before_oomkill(oc);
> > +   if (oc->page)
> > +   return true;
> > +   }
> > dump_header(oc, NULL);
> > panic("Out of memory and no killable processes...\n");
> > }
> 
> Also, is there any strong reason to not do the last allocation after
> select_bad_process rather than having two call sites? I would understand
> that if you wanted to catch for_each_thread inside oom_kill_process but
> you are not doing that.

Unfortunately, we will after all have two call sites because we have
sysctl_oom_kill_allocating_task path.

V2 patch follows. Andrea, will you check that your intent of using high
watermark for last second allocation attempt in the change log is correct?

Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall)

2017-11-01 Thread Tetsuo Handa

On 2017/11/01 14:32, Shawn Landden wrote:
> @@ -1029,6 +1030,22 @@ bool out_of_memory(struct oom_control *oc)
>   return true;
>   }
>  
> + /*
> +  * Check death row.
> +  */
> + if (!list_empty(eventpoll_deathrow_list())) {
> + struct list_head *l = eventpoll_deathrow_list();

Unsafe traversal. List can become empty at this moment.

> + struct task_struct *ts = list_first_entry(l,
> +  struct task_struct, se.deathrow);
> +
> + pr_debug("Killing pid %u from EPOLL_KILLME death row.",
> + ts->pid);
> +
> + /* We use SIGKILL so as to cleanly interrupt ep_poll() */
> + kill_pid(task_pid(ts), SIGKILL, 1);

send_sig() ?

> + return true;
> + }
> +
>   /*
>* The OOM killer does not compensate for IO-less reclaim.
>* pagefault_out_of_memory lost its gfp context so we have to
> 

And why is

  static int oom_fd = open("/proc/self/oom_score_adj", O_WRONLY);

and then toggling between

  write(fd, "1000", 4);

and

  write(fd, "0", 1);

not sufficient? Adding prctl() that do this might be handy though.

Re: [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.

2017-11-01 Thread Tetsuo Handa

Michal Hocko wrote:
> > Does "that comment" refer to
> > 
> >   Elaborating the comment: the reason for the high wmark is to reduce
> >   the likelihood of livelocks and be sure to invoke the OOM killer, if
> >   we're still under pressure and reclaim just failed. The high wmark is
> >   used to be sure the failure of reclaim isn't going to be ignored. If
> >   using the min wmark like you propose there's risk of livelock or
> >   anyway of delayed OOM killer invocation.
> > 
> > part? Then, I know it is not about gfp flags.
> > 
> > But how can OOM livelock happen when the last second allocation does not
> > wait for memory reclaim (because __GFP_DIRECT_RECLAIM is masked) ?
> > The last second allocation shall return immediately, and we will call
> > out_of_memory() if the last second allocation failed.
> 
> I think Andrea just wanted to say that we do want to invoke OOM killer
> and resolve the memory pressure rather than keep looping in the
> reclaim/oom path just because there are few pages allocated and freed in
> the meantime.

I see. Then, that motivation no longer applies to current code, except

> 
> [...]
> > > I am not sure such a scenario matters all that much because it assumes
> > > that the oom victim doesn't really free much memory [1] (basically less 
> > > than
> > > HIGH-MIN). Most OOM situation simply have a memory hog consuming
> > > significant amount of memory.
> > 
> > The OOM killer does not always kill a memory hog consuming significant 
> > amount
> > of memory. The OOM killer kills a process with highest OOM score (and 
> > instead
> > one of its children if any). I don't think that assuming an OOM victim will 
> > free
> > memory enough to succeed ALLOC_WMARK_HIGH is appropriate.
> 
> OK, so let's agree to disagree. I claim that we shouldn't care all that
> much. If any of the current heuristics turns out to lead to killing too
> many tasks then we should simply remove it rather than keep bloating an
> already complex code with more and more kluges.

using ALLOC_WMARK_HIGH might cause more OOM-killing than ALLOC_WMARK_MIN.

Thanks for clarification.

Re: [PATCH 2/2] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation.

2017-11-01 Thread Tetsuo Handa

Michal Hocko wrote:
> On Wed 01-11-17 20:54:28, Tetsuo Handa wrote:
> > Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> > oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
> > to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
> > victim's mm were not able to try allocation from memory reserves after the
> > OOM reaper gave up reclaiming memory.
> > 
> > Until Linux 4.7, we were using
> > 
> >   if (current->mm &&
> >   (fatal_signal_pending(current) || task_will_free_mem(current)))
> > 
> > as a condition to try allocation from memory reserves with the risk of OOM
> > lockup, but reports like [1] were impossible. Linux 4.8+ are regressed
> > compared to Linux 4.7 due to the risk of needlessly selecting more OOM
> > victims.
> 
> So what you are essentially saying is that there is a race window
> Proc1 Proc2   
> oom_reaper
> __alloc_pages_slowpathout_of_memory
>   __gfp_pfmemalloc_flagsselect_bad_process # Proc1
> [1]  oom_reserves_allowed # false   oom_kill_process
> 
> oom_reap_task
>   __alloc_pages_may_oom   
> __oom_reap_task_mm
> # 
> doesn't unmap anything
>   
> set_bit(MMF_OOM_SKIP)
> out_of_memory
>   task_will_free_mem
> [2] MMF_OOM_SKIP check # true
>   select_bad_process # Another victim
> 
> mostly because the above is an artificial workload which triggers the
> pathological path where nothing is really unmapped due to mlocked
> memory,

Right.

> which makes the race window (1-2) smaller than it usually is.

The race window (1-2) was larger than __oom_reap_task_mm() usually takes.

>   So
> this is pretty much a corner case which we want to address by making
> mlocked pages really reapable. Trying to use memory reserves for the
> oom victims reduces changes of the race.

Right. We cannot prevent non OOM victims from calling oom_kill_process().
But preventing existing OOM victims from calling oom_kill_process() (by
allowing them to try ALLOC_OOM allocation) can reduce subsequent OOM victims.

> 
> This would be really useful to have in the changelog IMHO.
> 
> > There is no need that the OOM victim is such malicious that consumes all
> > memory. It is possible that a multithreaded but non memory hog process is
> > selected by the OOM killer, and the OOM reaper fails to reclaim memory due
> > to e.g. khugepaged [2], and the process fails to try allocation from memory
> > reserves.
> 
> I am not sure about this part though. If the oom_reaper cannot take the
> mmap_sem then it retries for 1s. Have you ever seen the race to be that
> large?

Like shown in [2], khugepaged can prevent oom_reaper from taking the mmap_sem
for 1 second. Also, it won't be impossible for OOM victims to spend 1 second
between post __gfp_pfmemalloc_flags(gfp_mask) and pre mutex_trylock(&oom_lock)
(in other words, the race window (1-2) above). Therefore, non artificial
workloads could hit the same result.

> 
> > Therefore, this patch allows OOM victims to use ALLOC_OOM watermark
> > for last second allocation attempt.
> > 
> > [1] 
> > http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde...@caviumnetworks.com
> > [2] 
> > http://lkml.kernel.org/r/201708090835.ici69305.vffolmhostj...@i-love.sakura.ne.jp
> > 
> > Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip 
> > oom_reaped tasks")
> > Reported-by: Manish Jaggi 
> > Signed-off-by: Tetsuo Handa 
> > Cc: Michal Hocko 
> > Cc: Oleg Nesterov 
> > Cc: Vladimir Davydov 
> > Cc: David Rientjes 
> > ---
> >  mm/page_alloc.c | 5 +
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 6654f52..382ed57 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4112,9 +4112,14 @@ struct page *alloc_pages_before_oomkill(const struct 
> > oom_control *oc)
> >  * we're still under heavy pressure. But make sure that this reclaim
> >  * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> >  * allocation which will never fail due to oom_lock already held.
> > +

Re: [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.

2017-11-01 Thread Tetsuo Handa

Michal Hocko wrote:
> On Wed 01-11-17 20:58:50, Tetsuo Handa wrote:
> > > > But doing ALLOC_OOM for last second allocation attempt from 
> > > > out_of_memory() involve
> > > > duplicating code (e.g. rebuilding zone list).
> > > 
> > > Why would you do it? Do not blindly copy and paste code without
> > > a good reason. What kind of problem does this actually solve?
> > 
> > prepare_alloc_pages()/finalise_ac() initializes as
> > 
> > ac->high_zoneidx = gfp_zone(gfp_mask);
> > ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> >  ac->high_zoneidx, 
> > ac->nodemask);
> > 
> > and selecting as an OOM victim reinitializes as
> > 
> > ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
> > ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> >  ac->high_zoneidx, 
> > ac->nodemask);
> > 
> > and I assume that this reinitialization might affect which memory reserve
> > the OOM victim allocates from.
> > 
> > You mean such difference is too trivial to care about?
> 
> You keep repeating what the _current_ code does without explaining _why_
> do we need the same thing in the oom path. Could you finaly answer my
> question please?

Because I consider that following what the current code does is reasonable
unless there are explicit reasons not to follow.

> 
> > > > What is your preferred approach?
> > > > Duplicate relevant code? Use get_page_from_freelist() without 
> > > > rebuilding the zone list?
> > > > Use __alloc_pages_nodemask() ?
> > > 
> > > Just do what we do now with ALLOC_WMARK_HIGH and in a separate patch use
> > > ALLOC_OOM for oom victims. There shouldn't be any reasons to play
> > > additional tricks here.
> > > 
> > 
> > Posted as 
> > http://lkml.kernel.org/r/1509537268-4726-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
> >  .
> > 
> > But I'm still unable to understand why "moving get_page_from_freelist to
> > oom_kill_process" is better than "copying get_page_from_freelist to
> > oom_kill_process", for "moving" increases possibility of allocation failures
> > when out_of_memory() is not called.
> 
> The changelog I have provided to you should answer that. It is highly
> unlikely there high wmark would succeed _right_ after we have just given
> up. If this assumption is not correct then we can _add_ such a call
> based on a real data rather than add more bloat "just because we used to
> do that". As I've said I completely hate the cargo cult programming. Do
> not add more.

I think that it is highly unlikely high wmark would succeed. But
I don't think that it is highly unlikely normal wmark would succeed.

> 
> > Also, I'm still unable to understand why
> > to use ALLOC_WMARK_HIGH. I think that using regular watermark for last 
> > second
> > allocation attempt is better (as described below).
> 
> If you believe that a standard wmark is sufficient then make it a
> separate patch with the full explanation why.
> 
> > __alloc_pages_may_oom() is doing last second allocation attempt using
> > ALLOC_WMARK_HIGH before calling out_of_memory(). This has two motivations.
> > The first one is explained by the comment that it aims to catch potential
> > parallel OOM killing and the second one was explained by Andrea Arcangeli
> > as follows:
> > : Elaborating the comment: the reason for the high wmark is to reduce
> > : the likelihood of livelocks and be sure to invoke the OOM killer, if
> > : we're still under pressure and reclaim just failed. The high wmark is
> > : used to be sure the failure of reclaim isn't going to be ignored. If
> > : using the min wmark like you propose there's risk of livelock or
> > : anyway of delayed OOM killer invocation.
> > 
> > But neither motivation applies to current code. Regarding the former,
> > there is no parallel OOM killing (in the sense that out_of_memory() is
> > called "concurrently") because we serialize out_of_memory() calls using
> > oom_lock. Regarding the latter, there is no possibility of OOM livelocks
> > nor possibility of failing to invoke the OOM killer because we mask
> > __GFP_DIRECT_RECLAIM for last second allocation attempt because oom_lock
> > prevents __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations whic

Re: [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.

2017-11-01 Thread Tetsuo Handa

Michal Hocko wrote:
> On Tue 31-10-17 22:51:49, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 31-10-17 22:13:05, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Tue 31-10-17 21:42:23, Tetsuo Handa wrote:
> > > > > > > While both have some merit, the first reason is mostly historical
> > > > > > > because we have the explicit locking now and it is really 
> > > > > > > unlikely that
> > > > > > > the memory would be available right after we have given up trying.
> > > > > > > Last attempt allocation makes some sense of course but 
> > > > > > > considering that
> > > > > > > the oom victim selection is quite an expensive operation which 
> > > > > > > can take
> > > > > > > a considerable amount of time it makes much more sense to retry 
> > > > > > > the
> > > > > > > allocation after the most expensive part rather than before. 
> > > > > > > Therefore
> > > > > > > move the last attempt right before we are trying to kill an oom 
> > > > > > > victim
> > > > > > > to rule potential races when somebody could have freed a lot of 
> > > > > > > memory
> > > > > > > in the meantime. This will reduce the time window for potentially
> > > > > > > pre-mature OOM killing considerably.
> > > > > > 
> > > > > > But this is about "doing last second allocation attempt after 
> > > > > > selecting
> > > > > > an OOM victim". This is not about "allowing OOM victims to try 
> > > > > > ALLOC_OOM
> > > > > > before selecting next OOM victim" which is the actual problem I'm 
> > > > > > trying
> > > > > > to deal with.
> > > > > 
> > > > > then split it into two. First make the general case and then add a 
> > > > > more
> > > > > sophisticated on top. Dealing with multiple issues at once is what 
> > > > > makes
> > > > > all those brain cells suffer.
> > > > 
> > > > I'm failing to understand. I was dealing with single issue at once.
> > > > The single issue is "MMF_OOM_SKIP prematurely prevents OOM victims from 
> > > > trying
> > > > ALLOC_OOM before selecting next OOM victims". Then, what are the 
> > > > general case and
> > > > a more sophisticated? I wonder what other than "MMF_OOM_SKIP should 
> > > > allow OOM
> > > > victims to try ALLOC_OOM for once before selecting next OOM victims" 
> > > > can exist...
> > > 
> > > Try to think little bit out of your very specific and borderline usecase
> > > and it will become obvious. ALLOC_OOM is a trivial update on top of
> > > moving get_page_from_freelist to oom_kill_process which is a more
> > > generic race window reducer.
> > 
> > So, you meant "doing last second allocation attempt after selecting an OOM 
> > victim"
> > as the general case and "using ALLOC_OOM at last second allocation attempt" 
> > as a
> > more sophisticated. Then, you won't object conditionally switching 
> > ALLOC_WMARK_HIGH
> > and ALLOC_OOM for last second allocation attempt, will you?
> 
> yes for oom_victims

OK.

> 
> > But doing ALLOC_OOM for last second allocation attempt from out_of_memory() 
> > involve
> > duplicating code (e.g. rebuilding zone list).
> 
> Why would you do it? Do not blindly copy and paste code without
> a good reason. What kind of problem does this actually solve?

prepare_alloc_pages()/finalise_ac() initializes as

ac->high_zoneidx = gfp_zone(gfp_mask);
ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
 ac->high_zoneidx, 
ac->nodemask);

and selecting as an OOM victim reinitializes as

ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
 ac->high_zoneidx, 
ac->nodemask);

and I assume that this reinitialization might affect which memory reserve
the OOM victim allocates from.

You mean such difference is too trivial to care about?

> 
> > What is your preferred approa

[PATCH 2/2] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation.

2017-11-01 Thread Tetsuo Handa

if (current->mm &&
  (fatal_signal_pending(current) || task_will_free_mem(current)))

as a condition to try allocation from memory reserves with the risk of OOM
lockup, but reports like [1] were impossible. Linux 4.8+ are regressed
compared to Linux 4.7 due to the risk of needlessly selecting more OOM
victims.

There is no need that the OOM victim is such malicious that consumes all
memory. It is possible that a multithreaded but non memory hog process is
selected by the OOM killer, and the OOM reaper fails to reclaim memory due
to e.g. khugepaged [2], and the process fails to try allocation from memory
reserves. Therefore, this patch allows OOM victims to use ALLOC_OOM watermark
for last second allocation attempt.

[1] 
http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde...@caviumnetworks.com
[2] 
http://lkml.kernel.org/r/201708090835.ici69305.vffolmhostj...@i-love.sakura.ne.jp

Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped 
tasks")
Reported-by: Manish Jaggi 
Signed-off-by: Tetsuo Handa 
Cc: Michal Hocko 
Cc: Oleg Nesterov 
Cc: Vladimir Davydov 
Cc: David Rientjes 
---
 mm/page_alloc.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6654f52..382ed57 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4112,9 +4112,14 @@ struct page *alloc_pages_before_oomkill(const struct 
oom_control *oc)
 * we're still under heavy pressure. But make sure that this reclaim
 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
 * allocation which will never fail due to oom_lock already held.
+* Also, make sure that OOM victims can try ALLOC_OOM watermark in case
+* they haven't tried ALLOC_OOM watermark.
 */
return get_page_from_freelist((oc->gfp_mask | __GFP_HARDWALL) &
  ~__GFP_DIRECT_RECLAIM, oc->order,
+ oom_reserves_allowed(current) &&
+ !(oc->gfp_mask & __GFP_NOMEMALLOC) ?
+ ALLOC_OOM :
  ALLOC_WMARK_HIGH|ALLOC_CPUSET, oc->ac);
 }
 
-- 
1.8.3.1

[PATCH 1/2] mm,oom: Move last second allocation to inside the OOM killer.

2017-11-01 Thread Tetsuo Handa

__alloc_pages_may_oom() is doing last second allocation attempt using
ALLOC_WMARK_HIGH before calling out_of_memory(). This has two motivations.
The first one is explained by the comment that it aims to catch potential
parallel OOM killing and the second one was explained by Andrea Arcangeli
as follows:
: Elaborating the comment: the reason for the high wmark is to reduce
: the likelihood of livelocks and be sure to invoke the OOM killer, if
: we're still under pressure and reclaim just failed. The high wmark is
: used to be sure the failure of reclaim isn't going to be ignored. If
: using the min wmark like you propose there's risk of livelock or
: anyway of delayed OOM killer invocation.

But there is no parallel OOM killing (in the sense that out_of_memory() is
called "concurrently") because we serialize out_of_memory() calls using
oom_lock. Regarding the latter, there is no possibility of OOM livelocks
nor possibility of failing to invoke the OOM killer because we mask
__GFP_DIRECT_RECLAIM for last second allocation attempt because oom_lock
prevents __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations which last
second allocation attempt indirectly involve from failing.

However, parallel OOM killing still exists (in the sense that
out_of_memory() is called "consecutively"). Sometimes doing last second
allocation attempt after selecting an OOM victim can succeed because
somebody might have managed to free memory while we were selecting an OOM
victim which can take quite some time. This suggests that giving up last
second allocation attempt as soon as ALLOC_WMARK_HIGH as of before
selecting an OOM victim fails can be premature.

Therefore, this patch moves last second allocation attempt to after
selecting an OOM victim. This patch is expected to reduce the time window
for potentially pre-mature OOM killing considerably, but this patch will
cause last second allocation attempt always fail if out_of_memory() is
not called.

Signed-off-by: Tetsuo Handa 
Suggested-by: Michal Hocko 
---
 include/linux/oom.h | 13 +
 mm/oom_kill.c   | 23 +++
 mm/page_alloc.c | 38 +-
 3 files changed, 57 insertions(+), 17 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4c..5ac2556 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -13,6 +13,8 @@
 struct notifier_block;
 struct mem_cgroup;
 struct task_struct;
+struct alloc_context;
+struct page;
 
 /*
  * Details of the page allocation that triggered the oom killer that are used 
to
@@ -37,6 +39,15 @@ struct oom_control {
 */
const int order;
 
+   /* Context for really last second allocation attempt. */
+   const struct alloc_context *ac;
+   /*
+* Set by the OOM killer if ac != NULL and last second allocation
+* attempt succeeded. If ac != NULL, the caller must check for
+* page != NULL.
+*/
+   struct page *page;
+
/* Used by oom implementation, do not set */
unsigned long totalpages;
struct task_struct *chosen;
@@ -101,6 +112,8 @@ extern unsigned long oom_badness(struct task_struct *p,
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern struct page *alloc_pages_before_oomkill(const struct oom_control *oc);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 26add8a..118ecdb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -870,6 +870,19 @@ static void oom_kill_process(struct oom_control *oc, const 
char *message)
}
task_unlock(p);
 
+   /*
+* Try really last second allocation attempt after we selected an OOM
+* victim, for somebody might have managed to free memory while we were
+* selecting an OOM victim which can take quite some time.
+*/
+   if (oc->ac) {
+   oc->page = alloc_pages_before_oomkill(oc);
+   if (oc->page) {
+   put_task_struct(p);
+   return;
+   }
+   }
+
if (__ratelimit(&oom_rs))
dump_header(oc, p);
 
@@ -1081,6 +1094,16 @@ bool out_of_memory(struct oom_control *oc)
select_bad_process(oc);
/* Found nothing?!?! Either we hang forever, or we panic. */
if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+   /*
+* Try really last second allocation attempt, for somebody
+* might have managed to free memory while we were trying to
+* find an OOM victim.
+*/
+   if (oc->ac) {
+   oc->page = alloc_pages_before_oomkill(oc);
+   if (oc->page)
+   return true;
+   }
dump_h

Re: [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.

2017-10-31 Thread Tetsuo Handa

Michal Hocko wrote:
> On Tue 31-10-17 22:13:05, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 31-10-17 21:42:23, Tetsuo Handa wrote:
> > > > > While both have some merit, the first reason is mostly historical
> > > > > because we have the explicit locking now and it is really unlikely 
> > > > > that
> > > > > the memory would be available right after we have given up trying.
> > > > > Last attempt allocation makes some sense of course but considering 
> > > > > that
> > > > > the oom victim selection is quite an expensive operation which can 
> > > > > take
> > > > > a considerable amount of time it makes much more sense to retry the
> > > > > allocation after the most expensive part rather than before. Therefore
> > > > > move the last attempt right before we are trying to kill an oom victim
> > > > > to rule potential races when somebody could have freed a lot of memory
> > > > > in the meantime. This will reduce the time window for potentially
> > > > > pre-mature OOM killing considerably.
> > > > 
> > > > But this is about "doing last second allocation attempt after selecting
> > > > an OOM victim". This is not about "allowing OOM victims to try ALLOC_OOM
> > > > before selecting next OOM victim" which is the actual problem I'm trying
> > > > to deal with.
> > > 
> > > then split it into two. First make the general case and then add a more
> > > sophisticated on top. Dealing with multiple issues at once is what makes
> > > all those brain cells suffer.
> > 
> > I'm failing to understand. I was dealing with single issue at once.
> > The single issue is "MMF_OOM_SKIP prematurely prevents OOM victims from 
> > trying
> > ALLOC_OOM before selecting next OOM victims". Then, what are the general 
> > case and
> > a more sophisticated? I wonder what other than "MMF_OOM_SKIP should allow 
> > OOM
> > victims to try ALLOC_OOM for once before selecting next OOM victims" can 
> > exist...
> 
> Try to think little bit out of your very specific and borderline usecase
> and it will become obvious. ALLOC_OOM is a trivial update on top of
> moving get_page_from_freelist to oom_kill_process which is a more
> generic race window reducer.

So, you meant "doing last second allocation attempt after selecting an OOM 
victim"
as the general case and "using ALLOC_OOM at last second allocation attempt" as a
more sophisticated. Then, you won't object conditionally switching 
ALLOC_WMARK_HIGH
and ALLOC_OOM for last second allocation attempt, will you?

But doing ALLOC_OOM for last second allocation attempt from out_of_memory() 
involve
duplicating code (e.g. rebuilding zone list). What is your preferred approach?
Duplicate relevant code? Use get_page_from_freelist() without rebuilding the 
zone list?
Use __alloc_pages_nodemask() ?

Re: [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.

2017-10-31 Thread Tetsuo Handa

Michal Hocko wrote:
> On Tue 31-10-17 21:42:23, Tetsuo Handa wrote:
> > > While both have some merit, the first reason is mostly historical
> > > because we have the explicit locking now and it is really unlikely that
> > > the memory would be available right after we have given up trying.
> > > Last attempt allocation makes some sense of course but considering that
> > > the oom victim selection is quite an expensive operation which can take
> > > a considerable amount of time it makes much more sense to retry the
> > > allocation after the most expensive part rather than before. Therefore
> > > move the last attempt right before we are trying to kill an oom victim
> > > to rule potential races when somebody could have freed a lot of memory
> > > in the meantime. This will reduce the time window for potentially
> > > pre-mature OOM killing considerably.
> > 
> > But this is about "doing last second allocation attempt after selecting
> > an OOM victim". This is not about "allowing OOM victims to try ALLOC_OOM
> > before selecting next OOM victim" which is the actual problem I'm trying
> > to deal with.
> 
> then split it into two. First make the general case and then add a more
> sophisticated on top. Dealing with multiple issues at once is what makes
> all those brain cells suffer.

I'm failing to understand. I was dealing with single issue at once.
The single issue is "MMF_OOM_SKIP prematurely prevents OOM victims from trying
ALLOC_OOM before selecting next OOM victims". Then, what are the general case 
and
a more sophisticated? I wonder what other than "MMF_OOM_SKIP should allow OOM
victims to try ALLOC_OOM for once before selecting next OOM victims" can 
exist...

Re: [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.

2017-10-31 Thread Tetsuo Handa

Michal Hocko wrote:
> On Tue 31-10-17 19:40:09, Tetsuo Handa wrote:
> > The reason I used __alloc_pages_slowpath() in alloc_pages_before_oomkill() 
> > is
> > to avoid duplicating code (such as checking for ALLOC_OOM and rebuilding 
> > zone
> > list) which needs to be maintained in sync with __alloc_pages_slowpath().
> >
> > If you don't like calling __alloc_pages_slowpath() from
> > alloc_pages_before_oomkill(), I'm OK with calling __alloc_pages_nodemask()
> > (with __GFP_DIRECT_RECLAIM/__GFP_NOFAIL cleared and __GFP_NOWARN set), for
> > direct reclaim functions can call __alloc_pages_nodemask() (with PF_MEMALLOC
> > set in order to avoid recursion of direct reclaim).
> > 
> > We are rebuilding zone list if selected as an OOM victim, for
> > __gfp_pfmemalloc_flags() returns ALLOC_OOM if oom_reserves_allowed(current)
> > is true.
> 
> So your answer is copy&paste without a deeper understanding, righ?

Right. I wanted to avoid duplicating code.
But I had to duplicate in order to allow OOM victims to try ALLOC_OOM.

> 
> [...]
> 
> > The reason I'm proposing this "mm,oom: Try last second allocation before and
> > after selecting an OOM victim." is that since oom_reserves_allowed(current) 
> > can
> > become true when current is between post __gfp_pfmemalloc_flags(gfp_mask) 
> > and
> > pre mutex_trylock(&oom_lock), an OOM victim can fail to try ALLOC_OOM 
> > attempt
> > before selecting next OOM victim when MMF_OOM_SKIP was set quickly.
> 
> ENOPARSE. I am not even going to finish this email sorry. This is way
> beyond my time budget.
> 
> Can you actually come with something that doesn't make ones head explode
> and yet describe what the actual problem is and how you deal with it?

http://lkml.kernel.org/r/201708191523.bjh90621.mhooffqsolj...@i-love.sakura.ne.jp
is least head exploding while it describes what the actual problem is and
how I deal with it.

> 
> E.g something like this
> "
> OOM killer is invoked after all the reclaim attempts have failed and
> there doesn't seem to be a viable chance for the situation to change.
> __alloc_pages_may_oom tries to reduce chances of a race during OOM
> handling by taking oom lock so only one caller is allowed to really
> invoke the oom killer.

OK.
> 
> __alloc_pages_may_oom also tries last time ALLOC_WMARK_HIGH allocation
> request before really invoking out_of_memory handler. This has two
> motivations. The first one is explained by the comment and it aims to
> catch potential parallel OOM killing and the second one was explained by
> Andrea Arcangeli as follows:
> : Elaborating the comment: the reason for the high wmark is to reduce
> : the likelihood of livelocks and be sure to invoke the OOM killer, if
> : we're still under pressure and reclaim just failed. The high wmark is
> : used to be sure the failure of reclaim isn't going to be ignored. If
> : using the min wmark like you propose there's risk of livelock or
> : anyway of delayed OOM killer invocation.
> 

OK.

> While both have some merit, the first reason is mostly historical
> because we have the explicit locking now and it is really unlikely that
> the memory would be available right after we have given up trying.
> Last attempt allocation makes some sense of course but considering that
> the oom victim selection is quite an expensive operation which can take
> a considerable amount of time it makes much more sense to retry the
> allocation after the most expensive part rather than before. Therefore
> move the last attempt right before we are trying to kill an oom victim
> to rule potential races when somebody could have freed a lot of memory
> in the meantime. This will reduce the time window for potentially
> pre-mature OOM killing considerably.

But this is about "doing last second allocation attempt after selecting
an OOM victim". This is not about "allowing OOM victims to try ALLOC_OOM
before selecting next OOM victim" which is the actual problem I'm trying
to deal with. Moving last second allocation attempt from "before" to
"after" does not solve the problem if ALLOC_OOM cannot be used. What I'm
proposing is to allow OOM victims to try ALLOC_OOM.

> "

Re: [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.

2017-10-31 Thread Tetsuo Handa

e the OOM killer,

  I am not sure how much that reason applies to the current code but if it
  still applies then we should do the same for later
  last-minute-allocation as well. Having both and disagreeing is just a
  mess.

. Therefore, I proposed this "mm,oom: Try last second allocation before and
after selecting an OOM victim." which uses the same watermark, and this time
you are still worrying about stop using ALLOC_WMARK_HIGH. You are giving
inconsistent messages here. If stop using ALLOC_WMARK_HIGH has some risk (which
Andrea needs to clarify it), we can't stop using ALLOC_WMARK_HIGH here. But we
have to allow using ALLOC_OOM (either before and/or after selecting an OOM
victim) which will result in disagreement you don't like, for we cannot stop
using ALLOC_WMARK_HIGH when we need to use ALLOC_OOM in order to solve the race
window which [1] tries to handle.

[1] 
http://lkml.kernel.org/r/201708191523.bjh90621.mhooffqsolj...@i-love.sakura.ne.jp
[2] http://lkml.kernel.org/r/20170821121022.gf25...@dhcp22.suse.cz
[3] 
http://lkml.kernel.org/r/1503577106-9196-2-git-send-email-penguin-ker...@i-love.sakura.ne.jp
[4] http://lkml.kernel.org/r/20170825080020.ge25...@dhcp22.suse.cz

> 
> > +   return get_page_from_freelist(gfp_mask, oc->order, alloc_flags, ac);
> > +}
> > +
> >  static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
> > int preferred_nid, nodemask_t *nodemask,
> > struct alloc_context *ac, gfp_t *alloc_mask,

> On Sat 28-10-17 17:07:09, Tetsuo Handa wrote:
> > This patch splits last second allocation attempt into two locations, once
> > before selecting an OOM victim and again after selecting an OOM victim,
> > and uses normal watermark for last second allocation attempts.
> 
> Why do we need both?

For two reasons.

(1) You said

  E.g. do an allocation attempt
  _before_ we do any disruptive action (aka kill a victim). This would
  help other cases when we race with an exiting tasks or somebody managed
  to free memory while we were selecting an oom victim which can take
  quite some time.

at [2]. By doing really last second allocation attempt after we select
an OOM victim, we can remove oom_lock serialization from the OOM reaper
which is currently there in order to avoid racing with MMF_OOM_SKIP.

(2) I said

  Since we call panic() before calling oom_kill_process() when there is
  no killable process, panic() will be prematurely called which could
  have been avoided if [2] is used. For example, if a multithreaded
  application running with a dedicated CPUs/memory was OOM-killed, we
  can wait until ALLOC_OOM allocation fails to solve OOM situation.

   at [3]. By doing almost last second allocation attempt before we select
   an OOM victim, we can avoid needlessly calling panic() when there is no
   other eligible threads other than existing OOM victim threads, as well as
   avoid needlessly calling out_of_memory() which you think that selecting
   an OOM victim can take quite some time.

> 
> > As of linux-2.6.11, nothing prevented from concurrently calling
> > out_of_memory(). TIF_MEMDIE test in select_bad_process() tried to avoid
> > needless OOM killing. Thus, it was safe to do __GFP_DIRECT_RECLAIM
> > allocation (apart from which watermark should be used) just before
> > calling out_of_memory().
> > 
> > As of linux-2.6.24, try_set_zone_oom() was added to
> > __alloc_pages_may_oom() by commit ff0ceb9deb6eb017 ("oom: serialize out
> > of memory calls") which effectively started acting as a kind of today's
> > mutex_trylock(&oom_lock).
> > 
> > As of linux-4.2, try_set_zone_oom() was replaced with oom_lock by
> > commit dc56401fc9f25e8f ("mm: oom_kill: simplify OOM killer locking").
> > At least by this time, it became no longer safe to do
> > __GFP_DIRECT_RECLAIM allocation with oom_lock held.
> > 
> > And as of linux-4.13, last second allocation attempt stopped using
> > __GFP_DIRECT_RECLAIM by commit e746bf730a76fe53 ("mm,page_alloc: don't
> > call __node_reclaim() with oom_lock held.").
> > 
> > Therefore, there is no longer valid reason to use ALLOC_WMARK_HIGH for
> > last second allocation attempt [1].
> 
> Another reason to use the high watermark as explained by Andrea was
> "
> : Elaborating the comment: the reason for the high wmark is to reduce
> : the likelihood of livelocks and be sure to invoke the OOM killer, if
> : we're still under pressure and reclaim just failed. The high wmark is
> : used to be sure the failure of reclaim isn't going to be ignored. If
> : using the min wmark like you propose there's risk of livelock or
> : anyway of delayed OOM killer

[PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.

2017-10-28 Thread Tetsuo Handa

This patch splits last second allocation attempt into two locations, once
before selecting an OOM victim and again after selecting an OOM victim,
and uses normal watermark for last second allocation attempts.

As of linux-2.6.11, nothing prevented from concurrently calling
out_of_memory(). TIF_MEMDIE test in select_bad_process() tried to avoid
needless OOM killing. Thus, it was safe to do __GFP_DIRECT_RECLAIM
allocation (apart from which watermark should be used) just before
calling out_of_memory().

As of linux-2.6.24, try_set_zone_oom() was added to
__alloc_pages_may_oom() by commit ff0ceb9deb6eb017 ("oom: serialize out
of memory calls") which effectively started acting as a kind of today's
mutex_trylock(&oom_lock).

As of linux-4.2, try_set_zone_oom() was replaced with oom_lock by
commit dc56401fc9f25e8f ("mm: oom_kill: simplify OOM killer locking").
At least by this time, it became no longer safe to do
__GFP_DIRECT_RECLAIM allocation with oom_lock held.

And as of linux-4.13, last second allocation attempt stopped using
__GFP_DIRECT_RECLAIM by commit e746bf730a76fe53 ("mm,page_alloc: don't
call __node_reclaim() with oom_lock held.").

Therefore, there is no longer valid reason to use ALLOC_WMARK_HIGH for
last second allocation attempt [1]. And this patch changes to do normal
allocation attempt, with handling of ALLOC_OOM added in order to mitigate
extra OOM victim selection problem reported by Manish Jaggi [2].

Doing really last second allocation attempt after selecting an OOM victim
will also help the OOM reaper to start reclaiming memory without waiting
for oom_lock to be released.

[1] http://lkml.kernel.org/r/20160128163802.ga15...@dhcp22.suse.cz
[2] 
http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde...@caviumnetworks.com

Signed-off-by: Tetsuo Handa 
Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped 
tasks")
Cc: Michal Hocko 
Cc: Oleg Nesterov 
Cc: Andrea Arcangeli 
Cc: Johannes Weiner 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: Vladimir Davydov 
Cc: David Rientjes 
Cc: Manish Jaggi 
---
 include/linux/oom.h | 13 +
 mm/oom_kill.c   | 13 +
 mm/page_alloc.c | 47 ++-
 3 files changed, 60 insertions(+), 13 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4c..eb92aa8 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -13,6 +13,8 @@
 struct notifier_block;
 struct mem_cgroup;
 struct task_struct;
+struct alloc_context;
+struct page;
 
 /*
  * Details of the page allocation that triggered the oom killer that are used 
to
@@ -37,6 +39,15 @@ struct oom_control {
 */
const int order;
 
+   /* Context for really last second allocation attempt. */
+   struct alloc_context *ac;
+   /*
+* Set by the OOM killer if ac != NULL and last second allocation
+* attempt succeeded. If ac != NULL, the caller must check for
+* page != NULL.
+*/
+   struct page *page;
+
/* Used by oom implementation, do not set */
unsigned long totalpages;
struct task_struct *chosen;
@@ -101,6 +112,8 @@ extern unsigned long oom_badness(struct task_struct *p,
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern struct page *alloc_pages_before_oomkill(struct oom_control *oc);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 26add8a..dcde1d5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -870,6 +870,19 @@ static void oom_kill_process(struct oom_control *oc, const 
char *message)
}
task_unlock(p);
 
+   /*
+* Try really last second allocation attempt after we selected an OOM
+* victim, for somebody might have managed to free memory while we were
+* selecting an OOM victim which can take quite some time.
+*/
+   if (oc->ac) {
+   oc->page = alloc_pages_before_oomkill(oc);
+   if (oc->page) {
+   put_task_struct(p);
+   return;
+   }
+   }
+
if (__ratelimit(&oom_rs))
dump_header(oc, p);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97687b3..ba0ef7b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3265,7 +3265,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, 
const char *fmt, ...)
 
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
-   const struct alloc_context *ac, unsigned long *did_some_progress)
+   struct alloc_context *ac, unsigned long *did_some_progress)
 {
struct oom_control oc = {
.zonelist = ac->zonelist,
@@ -3273,6 +3273,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, 
const char *fmt, ...)
.memcg = NULL,
.gf

Re: [PATCH] fs, mm: account filp and names caches to kmemcg

2017-10-26 Thread Tetsuo Handa

On 2017/10/26 16:49, Michal Hocko wrote:
> On Wed 25-10-17 15:49:21, Greg Thelen wrote:
>> Johannes Weiner  wrote:
>>
>>> On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote:
> [...]
 So just to make it clear you would be OK with the retry on successful
 OOM killer invocation and force charge on oom failure, right?
>>>
>>> Yeah, that sounds reasonable to me.
>>
>> Assuming we're talking about retrying within try_charge(), then there's
>> a detail to iron out...
>>
>> If there is a pending oom victim blocked on a lock held by try_charge() 
>> caller
>> (the "#2 Locks" case), then I think repeated calls to out_of_memory() will
>> return true until the victim either gets MMF_OOM_SKIP or disappears.
> 
> true. And oom_reaper guarantees that MMF_OOM_SKIP gets set in the finit
> amount of time.

Just a confirmation. You are talking about kmemcg, aren't you? And kmemcg
depends on CONFIG_MMU=y, doesn't it? If no, there is no such guarantee.

> 
>> So a force
>> charge fallback might be a needed even with oom killer successful 
>> invocations.
>> Or we'll need to teach out_of_memory() to return three values (e.g. 
>> NO_VICTIM,
>> NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on NEW_VICTIM.
> 
> No we, really want to wait for the oom victim to do its job. The only
> thing we should be worried about is when out_of_memory doesn't invoke
> the reaper. There is only one case like that AFAIK - GFP_NOFS request. I
> have to think about this case some more. We currently fail in that case
> the request.
> 

Do we really need to apply

/*
 * The OOM killer does not compensate for IO-less reclaim.
 * pagefault_out_of_memory lost its gfp context so we have to
 * make sure exclude 0 mask - all other users should have at least
 * ___GFP_DIRECT_RECLAIM to get here.
 */
if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
return true;

unconditionally?

We can encourage !__GFP_FS allocations to use __GFP_NORETRY or
__GFP_RETRY_MAYFAIL if their allocations are not important.
Then, only important !__GFP_FS allocations will be checked here.
I think that we can allow such important allocations to invoke the OOM
killer (i.e. remove this check) because situation is already hopeless
if important !__GFP_FS allocations cannot make progress.

[PATCH] mm: don't warn about allocations which stall for too long

2017-10-26 Thread Tetsuo Handa

Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for too
long") was a great step for reducing possibility of silent hang up problem
caused by memory allocation stalls. But this commit reverts it, for it is
possible to trigger OOM lockup and/or soft lockups when many threads
concurrently called warn_alloc() (in order to warn about memory allocation
stalls) due to current implementation of printk(), and it is difficult to
obtain useful information due to limitation of synchronous warning
approach.

Current printk() implementation flushes all pending logs using the context
of a thread which called console_unlock(). printk() should be able to flush
all pending logs eventually unless somebody continues appending to printk()
buffer.

Since warn_alloc() started appending to printk() buffer while waiting for
oom_kill_process() to make forward progress when oom_kill_process() is
processing pending logs, it became possible for warn_alloc() to force
oom_kill_process() loop inside printk(). As a result, warn_alloc()
significantly increased possibility of preventing oom_kill_process() from
making forward progress.

-- Pseudo code start --
Before warn_alloc() was introduced:

  retry:
if (mutex_trylock(&oom_lock)) {
  while (atomic_read(&printk_pending_logs) > 0) {
atomic_dec(&printk_pending_logs);
print_one_log();
  }
  // Send SIGKILL here.
  mutex_unlock(&oom_lock)
}
goto retry;

After warn_alloc() was introduced:

  retry:
if (mutex_trylock(&oom_lock)) {
  while (atomic_read(&printk_pending_logs) > 0) {
atomic_dec(&printk_pending_logs);
print_one_log();
  }
  // Send SIGKILL here.
  mutex_unlock(&oom_lock)
} else if (waited_for_10seconds()) {
  atomic_inc(&printk_pending_logs);
}
goto retry;
-- Pseudo code end --

Although waited_for_10seconds() becomes true once per 10 seconds, unbounded
number of threads can call waited_for_10seconds() at the same time. Also,
since threads doing waited_for_10seconds() keep doing almost busy loop, the
thread doing print_one_log() can use little CPU resource. Therefore, this
situation can be simplified like

-- Pseudo code start --
  retry:
if (mutex_trylock(&oom_lock)) {
  while (atomic_read(&printk_pending_logs) > 0) {
atomic_dec(&printk_pending_logs);
print_one_log();
  }
  // Send SIGKILL here.
  mutex_unlock(&oom_lock)
} else {
  atomic_inc(&printk_pending_logs);
}
goto retry;
-- Pseudo code end --

when printk() is called faster than print_one_log() can process a log.

One of possible mitigation would be to introduce a new lock in order to
make sure that no other series of printk() (either oom_kill_process() or
warn_alloc()) can append to printk() buffer when one series of printk()
(either oom_kill_process() or warn_alloc()) is already in progress. Such
serialization will also help obtaining kernel messages in readable form.

-- Pseudo code start --
  retry:
if (mutex_trylock(&oom_lock)) {
  mutex_lock(&oom_printk_lock);
  while (atomic_read(&printk_pending_logs) > 0) {
atomic_dec(&printk_pending_logs);
print_one_log();
  }
  // Send SIGKILL here.
  mutex_unlock(&oom_printk_lock);
  mutex_unlock(&oom_lock)
} else {
  if (mutex_trylock(&oom_printk_lock)) {
atomic_inc(&printk_pending_logs);
mutex_unlock(&oom_printk_lock);
  }
}
goto retry;
-- Pseudo code end --

But this commit does not go that direction, for we don't want to introduce
a new lock dependency, and we unlikely be able to obtain useful information
even if we serialized oom_kill_process() and warn_alloc().

Synchronous approach is prone to unexpected results (e.g. too late [1], too
frequent [2], overlooked [3]). As far as I know, warn_alloc() never helped
with providing information other than "something is going wrong".
I want to consider asynchronous approach which can obtain information
during stalls with possibly relevant threads (e.g. the owner of oom_lock
and kswapd-like threads) and serve as a trigger for actions (e.g. turn
on/off tracepoints, ask libvirt daemon to take a memory dump of stalling
KVM guest for diagnostic purpose).

This commit temporarily looses ability to report e.g. OOM lockup due to
unable to invoke the OOM killer due to !__GFP_FS allocation request.
But asynchronous approach will be able to detect such situation and emit
warning. Thus, let's remove warn_alloc().

[1] https://bugzilla.kernel.org/show_bug.cgi?id=192981
[2] 
http://lkml.kernel.org/r/cam_iqpwupvgc2ky8m-9yukects+zkjidasnymx7rmcbjbfy...@mail.gmail.com
[3] commit db73ee0d46379922 ("mm, vmscan: do not loop on too_many_isolated for 
ever"))

Signed-off-by

Re: [PATCH] mm,page_alloc: Serialize out_of_memory() and allocation stall messages.

2017-10-25 Thread Tetsuo Handa

Tetsuo Handa wrote:
> While warn_alloc() messages are completely unreadable, what we should note 
> are that
> 
>  (a) out_of_memory() => oom_kill_process() => dump_header() => show_mem() => 
> printk()
>  got stuck at console_unlock() despite this is schedulable context.
> 
> --
> 2180:   for (;;) {
> 2181:   struct printk_log *msg;
> 2182:   size_t ext_len = 0;
> 2183:   size_t len;
> 2184:
> 2185:   printk_safe_enter_irqsave(flags);
> 2186:   raw_spin_lock(&logbuf_lock);
> (...snipped...)
> 2228:   console_idx = log_next(console_idx);
> 2229:   console_seq++;
> 2230:   raw_spin_unlock(&logbuf_lock);
> 2231:
> 2232:   stop_critical_timings();/* don't trace print latency 
> */
> 2233:   call_console_drivers(ext_text, ext_len, text, len);
> 2234:   start_critical_timings();
> 2235:   printk_safe_exit_irqrestore(flags); // 
> console_unlock+0x24e/0x4c0 is here.
> 2236:
> 2237:   if (do_cond_resched)
> 2238:   cond_resched();
> 2239:   }
> --

It turned out that cond_resched() was not called due to do_cond_resched == 0 
due to
preemptible() == 0 due to CONFIG_PREEMPT_COUNT=n despite 
CONFIG_PREEMPT_VOLUNTARY=y,
for CONFIG_PREEMPT_VOLUNTARY itself does not select CONFIG_PREEMPT_COUNT. 
Surprising...

Re: [PATCH] mm,page_alloc: Serialize out_of_memory() and allocation stall messages.

2017-10-24 Thread Tetsuo Handa

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Hell no! I've tried to be patient with you but it seems that is just
> > pointless waste of time. Such an approach is absolutely not acceptable.
> > You are adding an additional lock dependency into the picture. Say that
> > there is somebody stuck in warn_alloc path and cannot make a further
> > progress because printk got stuck. Now you are blocking oom_kill_process
> > as well. So the cure might be even worse than the problem.
> 
> Sigh... printk() can't get stuck unless somebody continues appending to
> printk() buffer. Otherwise, printk() cannot be used from arbitrary context.
> 
> You had better stop calling printk() with oom_lock held if you consider that
> printk() can get stuck.
> 

For explaining how stupid the printk() versus oom_lock dependency is, here is a
patch for reproducing soft lockup caused by uncontrolled warn_alloc().

Below patch is for 6cff0a118f23b98c ("Merge tag 'platform-drivers-x86-v4.14-3' 
of
git://git.infradead.org/linux-platform-drivers-x86"), which intentionally try to
let the thread holding oom_lock to get stuck at printk(). This patch does not 
change
functionality. This patch changes only frequency/timing for emulating worst 
situation.

--
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 77e4d3c..4c43f83 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3207,7 +3207,7 @@ static void warn_alloc_show_mem(gfp_t gfp_mask, 
nodemask_t *nodemask)
unsigned int filter = SHOW_MEM_FILTER_NODES;
static DEFINE_RATELIMIT_STATE(show_mem_rs, HZ, 1);
 
-   if (should_suppress_show_mem() || !__ratelimit(&show_mem_rs))
+   if (should_suppress_show_mem())
return;
 
/*
@@ -3232,7 +3232,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, 
const char *fmt, ...)
static DEFINE_RATELIMIT_STATE(nopage_rs, DEFAULT_RATELIMIT_INTERVAL,
  DEFAULT_RATELIMIT_BURST);
 
-   if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs))
+   if ((gfp_mask & __GFP_NOWARN))
return;
 
pr_warn("%s: ", current->comm);
@@ -4002,7 +4002,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
goto nopage;
 
/* Make sure we know about allocations which stall for too long */
-   if (time_after(jiffies, alloc_start + stall_timeout)) {
+   if (__mutex_owner(&oom_lock)) {
warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
"page allocation stalls for %ums, order:%u",
jiffies_to_msecs(jiffies-alloc_start), order);
--

Enable softlockup_panic so that we can know where the thread got stuck.

--
echo 9 > /proc/sys/kernel/printk
echo 1 > /proc/sys/kernel/sysrq
echo 1 > /proc/sys/kernel/softlockup_panic
--

Memory stressor is shown below. All processes run on CPU 0.

--
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
struct sched_param sp = { 0 };
cpu_set_t cpu = { { 1 } };
static int pipe_fd[2] = { EOF, EOF };
char *buf = NULL;
unsigned long size = 0;
unsigned int i;
int fd;
pipe(pipe_fd);
signal(SIGCLD, SIG_IGN);
if (fork() == 0) {
prctl(PR_SET_NAME, (unsigned long) "first-victim", 0, 0, 0);
while (1)
pause();
}
close(pipe_fd[1]);
sched_setaffinity(0, sizeof(cpu), &cpu);
prctl(PR_SET_NAME, (unsigned long) "normal-priority", 0, 0, 0);
for (i = 0; i < 1024; i++)
if (fork() == 0) {
char c;
/* Wait until the first-victim is OOM-killed. */
read(pipe_fd[0], &c, 1);
/* Try to consume as much CPU time as possible. */
while(1) {
void *ptr = mmap(NULL, 4096, PROT_READ | 
PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, EOF, 0);
munmap(ptr, 4096);
}
_exit(0);
}
close(pipe_fd[0]);
fd = open("/dev/zero", O_RDONLY);
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
sched_setscheduler(0, SCHED_IDLE, &sp);
prctl(PR_SET_NAME, (unsigned long) "idle-priority", 0, 0, 0);
while (size) {
int ret = read(fd, buf, size); /*

Re: [PATCH v1 1/3] virtio-balloon: replace the coarse-grained balloon_lock

2017-10-22 Thread Tetsuo Handa

Wei Wang wrote:
> >> @@ -162,20 +160,20 @@ static unsigned fill_balloon(struct virtio_balloon 
> >> *vb, size_t num)
> >>msleep(200);
> >>break;
> >>}
> >> -  set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> >> -  vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
> >> +  set_page_pfns(vb, pfns + num_pfns, page);
> >>if (!virtio_has_feature(vb->vdev,
> >>VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> >>adjust_managed_page_count(page, -1);
> >>}
> >>   
> >> -  num_allocated_pages = vb->num_pfns;
> >> +  mutex_lock(&vb->inflate_lock);
> >>/* Did we get any? */
> >> -  if (vb->num_pfns != 0)
> >> -  tell_host(vb, vb->inflate_vq);
> >> -  mutex_unlock(&vb->balloon_lock);
> >> +  if (num_pfns != 0)
> >> +  tell_host(vb, vb->inflate_vq, pfns, num_pfns);
> >> +  mutex_unlock(&vb->inflate_lock);
> >> +  atomic64_add(num_pfns, &vb->num_pages);
> > Isn't this addition too late? If leak_balloon() is called due to
> > out_of_memory(), it will fail to find up to dated vb->num_pages value.
> 
> Not really. I think the old way of implementation above:
> "vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE"
> isn't quite accurate, because "vb->num_page" should reflect the number of
> pages that have already been inflated, which means those pages have
> already been given to the host via "tell_host()".
> 
> If we update "vb->num_page" earlier before tell_host(), then it will 
> include the pages
> that haven't been given to the host, which I think shouldn't be counted 
> as inflated pages.
> 
> On the other hand, OOM will use leak_balloon() to release the pages that 
> should
> have already been inflated.

But leak_balloon() finds max inflated pages from vb->num_pages, doesn't it?

> 
> >>   
> >>/* We can only do one array worth at a time. */
> >> -  num = min(num, ARRAY_SIZE(vb->pfns));
> >> +  num = min_t(size_t, num, VIRTIO_BALLOON_ARRAY_PFNS_MAX);
> >>   
> >> -  mutex_lock(&vb->balloon_lock);
> >>/* We can't release more pages than taken */
> >> -  num = min(num, (size_t)vb->num_pages);
> >> -  for (vb->num_pfns = 0; vb->num_pfns < num;
> >> -   vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> >> +  num = min_t(size_t, num, atomic64_read(&vb->num_pages));
> >> +  for (num_pfns = 0; num_pfns < num;
> >> +   num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> >>page = balloon_page_dequeue(vb_dev_info);
> > If balloon_page_dequeue() can be concurrently called by both host's request
> > and guest's OOM event, is (!dequeued_page) test in balloon_page_dequeue() 
> > safe?
> 
> 
> I'm not sure about the question. The "dequeue_page" is a local variable
> in the function, why would it be unsafe for two invocations (the shared
> b_dev_info->pages are operated under a lock)?

I'm not MM person nor virtio person. I'm commenting from point of view of
safe programming. My question is, isn't there possibility of hitting

if (unlikely(list_empty(&b_dev_info->pages) &&
 !b_dev_info->isolated_pages))
BUG();

when things run concurrently.

Wei Wang wrote:
> On 10/22/2017 12:11 PM, Tetsuo Handa wrote:
> > Michael S. Tsirkin wrote:
> >>> - num_freed_pages = leak_balloon(vb, oom_pages);
> >>> +
> >>> + /* Don't deflate more than the number of inflated pages */
> >>> + while (npages && atomic64_read(&vb->num_pages))
> >>> + npages -= leak_balloon(vb, npages);
> > don't we need to abort if leak_balloon() returned 0 for some reason?
> 
> I don't think so. Returning 0 should be a normal case when the host tries
> to give back some pages to the guest, but there is no pages that have ever
> been inflated. For example, right after booting the guest, the host sends a
> deflating request to give the guest 1G memory, leak_balloon should return 0,
> and guest wouldn't get 1 more G memory.
> 
My question is, isn't there possibility of leak_balloon() returning 0 for
reasons other than vb->num_pages == 0 ? If yes, this can cause infinite loop
(i.e. lockups) when things run concurrently.

Re: [PATCH v1 1/3] virtio-balloon: replace the coarse-grained balloon_lock

2017-10-21 Thread Tetsuo Handa

Wei Wang wrote:
> The balloon_lock was used to synchronize the access demand to elements
> of struct virtio_balloon and its queue operations (please see commit
> e22504296d). This prevents the concurrent run of the leak_balloon and
> fill_balloon functions, thereby resulting in a deadlock issue on OOM:
> 
> fill_balloon: take balloon_lock and wait for OOM to get some memory;
> oom_notify: release some inflated memory via leak_balloon();
> leak_balloon: wait for balloon_lock to be released by fill_balloon.
> 
> This patch breaks the lock into two fine-grained inflate_lock and
> deflate_lock, and eliminates the unnecessary use of the shared data
> (i.e. vb->pnfs, vb->num_pfns). This enables leak_balloon and
> fill_balloon to run concurrently and solves the deadlock issue.
> 

> @@ -162,20 +160,20 @@ static unsigned fill_balloon(struct virtio_balloon *vb, 
> size_t num)
>   msleep(200);
>   break;
>   }
> - set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> - vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
> + set_page_pfns(vb, pfns + num_pfns, page);
>   if (!virtio_has_feature(vb->vdev,
>   VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
>   adjust_managed_page_count(page, -1);
>   }
>  
> - num_allocated_pages = vb->num_pfns;
> + mutex_lock(&vb->inflate_lock);
>   /* Did we get any? */
> - if (vb->num_pfns != 0)
> - tell_host(vb, vb->inflate_vq);
> - mutex_unlock(&vb->balloon_lock);
> + if (num_pfns != 0)
> + tell_host(vb, vb->inflate_vq, pfns, num_pfns);
> + mutex_unlock(&vb->inflate_lock);
> + atomic64_add(num_pfns, &vb->num_pages);

Isn't this addition too late? If leak_balloon() is called due to
out_of_memory(), it will fail to find up to dated vb->num_pages value.

>  
> - return num_allocated_pages;
> + return num_pfns;
>  }
>  
>  static void release_pages_balloon(struct virtio_balloon *vb,
> @@ -194,38 +192,39 @@ static void release_pages_balloon(struct virtio_balloon 
> *vb,
>  
>  static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  {
> - unsigned num_freed_pages;
>   struct page *page;
>   struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>   LIST_HEAD(pages);
> + unsigned int num_pfns;
> + __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];

This array consumes 1024 bytes of kernel stack, doesn't it?
leak_balloon() might be called from out_of_memory() where kernel stack
is already largely consumed before entering __alloc_pages_nodemask().
For reducing possibility of stack overflow, since out_of_memory() is
serialized by oom_lock, I suggest using static (maybe kmalloc()ed as
vb->oom_pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX]) buffer when called from
out_of_memory().

>  
>   /* We can only do one array worth at a time. */
> - num = min(num, ARRAY_SIZE(vb->pfns));
> + num = min_t(size_t, num, VIRTIO_BALLOON_ARRAY_PFNS_MAX);
>  
> - mutex_lock(&vb->balloon_lock);
>   /* We can't release more pages than taken */
> - num = min(num, (size_t)vb->num_pages);
> - for (vb->num_pfns = 0; vb->num_pfns < num;
> -  vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> + num = min_t(size_t, num, atomic64_read(&vb->num_pages));
> + for (num_pfns = 0; num_pfns < num;
> +  num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
>   page = balloon_page_dequeue(vb_dev_info);

If balloon_page_dequeue() can be concurrently called by both host's request
and guest's OOM event, is (!dequeued_page) test in balloon_page_dequeue() safe?
Is such concurrency needed?

>   if (!page)
>   break;
> - set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> + set_page_pfns(vb, pfns + num_pfns, page);
>   list_add(&page->lru, &pages);
> - vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>   }
>  
> - num_freed_pages = vb->num_pfns;
>   /*
>* Note that if
>* virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>* is true, we *have* to do it in this order
>*/
> - if (vb->num_pfns != 0)
> - tell_host(vb, vb->deflate_vq);
> + mutex_lock(&vb->deflate_lock);
> + if (num_pfns != 0)
> + tell_host(vb, vb->deflate_vq, pfns, num_pfns);
> + mutex_unlock(&vb->deflate_lock);
>   release_pages_balloon(vb, &pages);
> - mutex_unlock(&vb->balloon_lock);
> - return num_freed_pages;
> + atomic64_sub(num_pfns, &vb->num_pages);

Isn't this subtraction too late?

> +
> + return num_pfns;
>  }
>  
>  static inline void update_stat(struct virtio_balloon *vb, int idx,

> @@ -465,6 +464,7 @@ static int virtballoon_migratepage(struct 
> balloon_dev_info *vb_dev_info,
>   struct virtio_balloon *vb = container_of(vb_dev_info,
>   struct virtio_balloon, vb_dev_in

Re: [PATCH v1 2/3] virtio-balloon: deflate up to oom_pages on OOM

2017-10-21 Thread Tetsuo Handa

Michael S. Tsirkin wrote:
> On Fri, Oct 20, 2017 at 07:54:25PM +0800, Wei Wang wrote:
> > The current implementation only deflates 256 pages even when a user
> > specifies more than that via the oom_pages module param. This patch
> > enables the deflating of up to oom_pages pages if there are enough
> > inflated pages.
> 
> This seems reasonable. Does this by itself help?

At least

> > -   num_freed_pages = leak_balloon(vb, oom_pages);
> > +
> > +   /* Don't deflate more than the number of inflated pages */
> > +   while (npages && atomic64_read(&vb->num_pages))
> > +   npages -= leak_balloon(vb, npages);

don't we need to abort if leak_balloon() returned 0 for some reason?

Re: [PATCH 0/8] CaitSith LSM module

2017-10-21 Thread Tetsuo Handa

Tetsuo Handa wrote:
> John Johansen wrote:
> > On 05/20/2017 09:59 PM, Tetsuo Handa wrote:
> > > John Johansen wrote:
> > >> On 11/22/2016 10:31 PM, Tetsuo Handa wrote:
> > >>> Tetsuo Handa wrote:
> > >>>> John Johansen wrote:
> > >>>>>> In order to minimize the burden of reviewing, this patchset 
> > >>>>>> implements
> > >>>>>> only functionality of checking program execution requests (i.e. 
> > >>>>>> execve()
> > >>>>>> system call) using pathnames. I'm planning to add other 
> > >>>>>> functionalities
> > >>>>>> after this version got included into mainline. You can find how 
> > >>>>>> future
> > >>>>>> versions of CaitSith will look like at http://caitsith.osdn.jp/ .
> > >>>>>>
> > >>>>> Thanks I've started working my way through this, but it is going to 
> > >>>>> take
> > >>>>> me a while.
> > >>>>>
> > >>>>
> > >>>> Thank you for your time.
> > >>>
> > >>> May I hear the status? Is there something I can do other than waiting?
> > >>>
> > >> progressing very slowly, I have some time over the next few days as its a
> > >> long weekend here in the US some hopefully I can finish this up
> > >>
> > > 
> > > May I hear the status again?
> > > 
> > Yes, sorry. I just haven't had time too look at it recently. I am sorry that
> > it has been so long. I am just going to have to book a day off and do it. 
> > I'll
> > see if I can't get a day next week (getting late but I can try or the 
> > following)
> 
> No problem. ;-) I assume reviewing a new module takes at least one year.
> Thank you for remembering.
> 

I'm still fighting with OOM killer related problems at MM subsystem. ;-)

As one year elapsed since I proposed CaitSith for upstream, I'd like to
hear the status again. I looked at
http://schd.ws/hosted_files/lss2017/8b/201709-LinuxSecuritySummit-Stacking.pdf .
How is ETA for Security Module Stacking? Is it a half year or so?

If it is likely take longer, should I resume proposing CaitSith for now
as one of "Minor modules" except security_module_enable() check added
until Security Module Stacking work completes? Or should I wait for
completion of stacking work? I want to know it, for recent proposals are
rather staying silent.

Regards.

Re: [PATCH] mm,page_alloc: Serialize out_of_memory() and allocation stall messages.

2017-10-20 Thread Tetsuo Handa

Michal Hocko wrote:
> On Thu 19-10-17 19:51:02, Tetsuo Handa wrote:
> > The printk() flooding problem caused by concurrent warn_alloc() calls was
> > already pointed out by me, and there are reports of soft lockups caused by
> > warn_alloc(). But this problem is left unhandled because Michal does not
> > like serialization from allocation path because he is worrying about
> > unexpected side effects and is asking to identify the root cause of soft
> > lockups and fix it. But at least consuming CPU resource by not serializing
> > concurrent printk() plays some role in the soft lockups, for currently
> > printk() can consume CPU resource forever as long as somebody is appending
> > to printk() buffer, and writing to consoles also needs CPU resource. That
> > is, needlessly consuming CPU resource when calling printk() has unexpected
> > side effects.
> > 
> > Although a proposal for offloading writing to consoles to a dedicated
> > kernel thread is in progress, it is not yet accepted. And, even after
> > the proposal is accepted, writing to printk() buffer faster than the
> > kernel thread can write to consoles will result in loss of messages.
> > We should refrain from "appending to printk() buffer" and "consuming CPU
> > resource" at the same time if possible. We should try to (and we can)
> > avoid appending to printk() buffer when printk() is concurrently called
> > for reporting the OOM killer and allocation stalls, in order to reduce
> > possibility of hitting soft lockups and getting unreadably-jumbled
> > messages.
> > 
> > Although avoid mixing both memory allocation stall/failure messages and
> > the OOM killer messages would be nice, oom_lock mutex should not be used
> > for this purpose, for waiting for oom_lock mutex at warn_alloc() can
> > prevent anybody from calling out_of_memory() from __alloc_pages_may_oom()
> > because currently __alloc_pages_may_oom() does not wait for oom_lock
> > (i.e. causes OOM lockups after all). Therefore, this patch adds a mutex
> > named "oom_printk_lock". Although using mutex_lock() in order to allow
> > printk() to use CPU resource for writing to consoles is better from the
> > point of view of flushing printk(), this patch uses mutex_trylock() for
> > allocation stall messages because Michal does not like serialization.
> 
> Hell no! I've tried to be patient with you but it seems that is just
> pointless waste of time. Such an approach is absolutely not acceptable.
> You are adding an additional lock dependency into the picture. Say that
> there is somebody stuck in warn_alloc path and cannot make a further
> progress because printk got stuck. Now you are blocking oom_kill_process
> as well. So the cure might be even worse than the problem.

Sigh... printk() can't get stuck unless somebody continues appending to
printk() buffer. Otherwise, printk() cannot be used from arbitrary context.

You had better stop calling printk() with oom_lock held if you consider that
printk() can get stuck.

I will say "Say that there is somebody stuck in oom_kill_process() path and
cannot make a further progress because printk() got stuck. Now you are keeping
the mutex_trylock(&oom_lock) thread who invoked the OOM killer defunctional by
forcing the !mutex_trylock(&oom_lock) threads to keep calling warn_alloc().
So calling warn_alloc() might be even worse than not calling warn_alloc()."
This is known as what we call printk() v.s. oom_lock deadlock which I can
observe with my stress tests.

If somebody continues appending to printk() buffer, such user has to be fixed.
And it is warn_alloc() who continues appending to printk() buffer. This patch
is for breaking the printk() continuation dependency by isolating each thread's
transaction. Despite this patch introduces a lock dependency, this patch is for
mitigating printk() v.s. oom_lock deadlock described above. (I said "mitigate"
rather than "remove", for other printk() sources if any could still preserve
printk() v.s. oom_lock deadlock.)

-- Pseudo code start --
Before warn_alloc() was introduced:

  retry:
if (mutex_trylock(&oom_lock)) {
  while (atomic_read(&printk_pending_logs) > 0) {
atomic_dec(&printk_pending_logs);
print_one_log();
  }
  // Send SIGKILL here.
  mutex_unlock(&oom_lock)
}
goto retry;

After warn_alloc() was introduced:

  retry:
if (mutex_trylock(&oom_lock)) {
  while (atomic_read(&printk_pending_logs) > 0) {
atomic_dec(&printk_pending_logs);
print_one_log();
  }
  // Send SIGKILL here.
  mutex_unlock(&oom_lock)
} else if (waited_for_10seconds()) {
  atomic_inc(&printk_

Re: [PATCH] tomoyo: fix timestamping for y2038

2017-10-19 Thread Tetsuo Handa

Arnd Bergmann wrote:
> Tomoyo uses an open-coded version of time_to_tm() to create a timestamp
> from the current time as read by get_seconds(). This will overflow and
> give wrong results on 32-bit systems in 2038.
> 
> To correct this, this changes the code to use ktime_get_real_seconds()
> and the generic time64_to_tm() function that are both y2038-safe.
> Using the library function avoids adding an expensive 64-bit division
> in this code and can benefit from any optimizations we do in common
> code.
> 
> Signed-off-by: Arnd Bergmann 
> ---
>  security/tomoyo/audit.c  |  2 +-
>  security/tomoyo/common.c |  4 ++--
>  security/tomoyo/common.h |  2 +-
>  security/tomoyo/util.c   | 39 +--
>  4 files changed, 13 insertions(+), 34 deletions(-)

Thank you.

Please fold below diff into your patch, for year calculation is wrong.

  #0047/10/19 20:08:17# profile=1 mode=learning granted=no (global-pid=1) 
task={ pid=1 ppid=0 uid=0 gid=0 euid=0 egid=0 suid=0 sgid=0 fsuid=0 fsgid=0 } 
path1={ uid=0 gid=0 ino=639202 major=8 minor=1 perm=0755 type=file } 
path1.parent={ uid=0 gid=0 ino=155 perm=0755 } exec={ 
realpath="/usr/lib/systemd/systemd" argc=5 envc=0 argv[]={ 
"/usr/lib/systemd/systemd" "--switched-root" "--system" "--deserialize" "21" } 
envp[]={ } }

--- a/security/tomoyo/util.c
+++ b/security/tomoyo/util.c
@@ -96,7 +96,7 @@ void tomoyo_convert_time(time64_t time64, struct tomoyo_time 
*stamp)
stamp->hour = tm.tm_hour;
stamp->day = tm.tm_mday;
stamp->month = tm.tm_mon + 1;
-   stamp->year = tm.tm_year - (1970 - 1900);
+   stamp->year = tm.tm_year + 1900;
 }
 
 /**

Then, you can add

Acked-by: Tetsuo Handa

[PATCH] mm,page_alloc: Serialize out_of_memory() and allocation stall messages.

2017-10-19 Thread Tetsuo Handa

The printk() flooding problem caused by concurrent warn_alloc() calls was
already pointed out by me, and there are reports of soft lockups caused by
warn_alloc(). But this problem is left unhandled because Michal does not
like serialization from allocation path because he is worrying about
unexpected side effects and is asking to identify the root cause of soft
lockups and fix it. But at least consuming CPU resource by not serializing
concurrent printk() plays some role in the soft lockups, for currently
printk() can consume CPU resource forever as long as somebody is appending
to printk() buffer, and writing to consoles also needs CPU resource. That
is, needlessly consuming CPU resource when calling printk() has unexpected
side effects.

Although a proposal for offloading writing to consoles to a dedicated
kernel thread is in progress, it is not yet accepted. And, even after
the proposal is accepted, writing to printk() buffer faster than the
kernel thread can write to consoles will result in loss of messages.
We should refrain from "appending to printk() buffer" and "consuming CPU
resource" at the same time if possible. We should try to (and we can)
avoid appending to printk() buffer when printk() is concurrently called
for reporting the OOM killer and allocation stalls, in order to reduce
possibility of hitting soft lockups and getting unreadably-jumbled
messages.

Although avoid mixing both memory allocation stall/failure messages and
the OOM killer messages would be nice, oom_lock mutex should not be used
for this purpose, for waiting for oom_lock mutex at warn_alloc() can
prevent anybody from calling out_of_memory() from __alloc_pages_may_oom()
because currently __alloc_pages_may_oom() does not wait for oom_lock
(i.e. causes OOM lockups after all). Therefore, this patch adds a mutex
named "oom_printk_lock". Although using mutex_lock() in order to allow
printk() to use CPU resource for writing to consoles is better from the
point of view of flushing printk(), this patch uses mutex_trylock() for
allocation stall messages because Michal does not like serialization.

Signed-off-by: Tetsuo Handa 
Reported-by: Cong Wang 
Reported-by: yuwang.yuwang 
Reported-by: Johannes Weiner 
Cc: Sergey Senozhatsky 
Cc: Petr Mladek 
Cc: Michal Hocko 
---
 include/linux/oom.h | 1 +
 mm/oom_kill.c   | 5 +
 mm/page_alloc.c | 4 +++-
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4c..1425767 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -44,6 +44,7 @@ struct oom_control {
 };
 
 extern struct mutex oom_lock;
+extern struct mutex oom_printk_lock;
 
 static inline void set_current_oom_origin(void)
 {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 26add8a..5aef9a6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -54,6 +54,7 @@
 int sysctl_oom_dump_tasks = 1;
 
 DEFINE_MUTEX(oom_lock);
+DEFINE_MUTEX(oom_printk_lock);
 
 #ifdef CONFIG_NUMA
 /**
@@ -1074,7 +1075,9 @@ bool out_of_memory(struct oom_control *oc)
current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
get_task_struct(current);
oc->chosen = current;
+   mutex_lock(&oom_printk_lock);
oom_kill_process(oc, "Out of memory 
(oom_kill_allocating_task)");
+   mutex_unlock(&oom_printk_lock);
return true;
}
 
@@ -1085,8 +1088,10 @@ bool out_of_memory(struct oom_control *oc)
panic("Out of memory and no killable processes...\n");
}
if (oc->chosen && oc->chosen != (void *)-1UL) {
+   mutex_lock(&oom_printk_lock);
oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
 "Memory cgroup out of memory");
+   mutex_unlock(&oom_printk_lock);
/*
 * Give the killed process a good chance to exit before trying
 * to allocate memory again.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97687b3..3766687 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3990,11 +3990,13 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
goto nopage;
 
/* Make sure we know about allocations which stall for too long */
-   if (time_after(jiffies, alloc_start + stall_timeout)) {
+   if (time_after(jiffies, alloc_start + stall_timeout) &&
+   mutex_trylock(&oom_printk_lock)) {
warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
"page allocation stalls for %ums, order:%u",
jiffies_to_msecs(jiffies-alloc_start), order);
stall_timeout += 10 * HZ;
+   mutex_unlock(&oom_printk_lock);
}
 
/* Avoid recursion of direct reclaim */
-- 
1.8.3.1

Re: [PATCH] virtio_balloon: fix deadlock on OOM

2017-10-13 Thread Tetsuo Handa

Michael S. Tsirkin wrote:
> This is a replacement for
>   [PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()
> but unlike that patch it actually deflates on oom even in presence of
> lock contention.

But Wei Wang is proposing VIRTIO_BALLOON_F_SG which will try to allocate
memory, isn't he?

> 
>  drivers/virtio/virtio_balloon.c| 30 ++
>  include/linux/balloon_compaction.h | 38 
> +-
>  mm/balloon_compaction.c| 27 +--
>  3 files changed, 80 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f0b3a0b..725e366 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -143,16 +143,14 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
> - struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>   unsigned num_allocated_pages;
> + unsigned num_pfns;
> + struct page *page;
> + LIST_HEAD(pages);
>  
> - /* We can only do one array worth at a time. */
> - num = min(num, ARRAY_SIZE(vb->pfns));
> -

I don't think moving this min() to later is correct, for
"num" can be e.g. 1048576, can't it?

> - mutex_lock(&vb->balloon_lock);
> - for (vb->num_pfns = 0; vb->num_pfns < num;
> -  vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> - struct page *page = balloon_page_enqueue(vb_dev_info);
> + for (num_pfns = 0; num_pfns < num;
> +  num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> + struct page *page = balloon_page_alloc();
>  
>   if (!page) {
>   dev_info_ratelimited(&vb->vdev->dev,
> @@ -162,6 +160,22 @@ static unsigned fill_balloon(struct virtio_balloon *vb, 
> size_t num)
>   msleep(200);
>   break;
>   }
> +
> + balloon_page_push(&pages, page);
> + }

If balloon_page_alloc() did not fail, it will queue "num"
(e.g. 1048576) pages into pages list, won't it?

> +
> + /* We can only do one array worth at a time. */
> + num = min(num, ARRAY_SIZE(vb->pfns));
> +

Now we cap "num" to VIRTIO_BALLOON_ARRAY_PFNS_MAX (which is 256), but

> + mutex_lock(&vb->balloon_lock);
> +
> + vb->num_pfns = 0;
> +
> + while ((page = balloon_page_pop(&pages))) {

this loop will repeat for e.g. 1048576 times, and

> + balloon_page_enqueue(&vb->vb_dev_info, page);
> +
> + vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE;
> +

we increment vb->num_pfns for e.g. 1048576 times which will go
beyond VIRTIO_BALLOON_ARRAY_PFNS_MAX array index.

>   set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>   vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>   if (!virtio_has_feature(vb->vdev,

Re: [PATCH] vmalloc: back off only when the current task is OOM killed

2017-10-10 Thread Tetsuo Handa

Michal Hocko wrote:
> On Tue 10-10-17 21:47:02, Tetsuo Handa wrote:
> > I think that massive vmalloc() consumers should be (as well as massive
> > alloc_page() consumers) careful such that they will be chosen as first OOM
> > victim, for vmalloc() does not abort as soon as an OOM occurs.
> 
> No. This would require to spread those checks all over the place. That
> is why we have that logic inside the allocator which fails the
> allocation at certain point in time. Large/unbound/user controlled sized
> allocations from the kernel are always a bug and really hard one to
> protect from. It is simply impossible to know the intention.
> 
> > Thus, I used
> > set_current_oom_origin()/clear_current_oom_origin() when I demonstrated
> > "complete" depletion.
> 
> which was a completely artificial example as already mentioned.
> 
> > > I have tried to explain this is not really needed before but you keep
> > > insisting which is highly annoying. The patch as is is not harmful but
> > > it is simply _pointless_ IMHO.
> > 
> > Then, how can massive vmalloc() consumers become careful?
> > Explicitly use __vmalloc() and pass __GFP_NOMEMALLOC ?
> > Then, what about adding some comment like "Never try to allocate large
> > memory using plain vmalloc(). Use __vmalloc() with __GFP_NOMEMALLOC." ?
> 
> Come on! Seriously we do expect some competence from the code running in
> the kernel space. We do not really need to add a comment that you
> shouldn't shoot your head because it might hurt. Please try to focus on
> real issues. There are many of them to chase after...
> 
My understanding is that vmalloc() is provided for allocating large memory
where kmalloc() is difficult to satisfy. If we say "do not allocate large
memory with vmalloc() because large allocations from the kernel are always
a bug", it sounds like denial of raison d'etre of vmalloc(). Strange...

But anyway, I am not bothered by vmalloc(). What I'm bothered is warn_alloc()
lockup. Please go ahead with removal of fatal_signal_pending() for vmalloc().

Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-10-10 Thread Tetsuo Handa

Wei Wang wrote:
> > And even if we could remove balloon_lock, you still cannot use
> > __GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
> > "whether it is safe to wait" flag from
> > "[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .
> 
> Without the lock being held, why couldn't we use __GFP_DIRECT_RECLAIM at 
> xb_set_page()?

Because of dependency shown below.

leak_balloon()
  xb_set_page()
xb_preload(GFP_KERNEL)
  kmalloc(GFP_KERNEL)
__alloc_pages_may_oom()
  Takes oom_lock
  out_of_memory()
blocking_notifier_call_chain()
  leak_balloon()
xb_set_page()
  xb_preload(GFP_KERNEL)
kmalloc(GFP_KERNEL)
  __alloc_pages_may_oom()
Fails to take oom_lock and loop forever

By the way, is xb_set_page() safe?
Sleeping in the kernel with preemption disabled is a bug, isn't it?
__radix_tree_preload() returns 0 with preemption disabled upon success.
xb_preload() disables preemption if __radix_tree_preload() fails.
Then, kmalloc() is called with preemption disabled, isn't it?
But xb_set_page() calls xb_preload(GFP_KERNEL) which might sleep with
preemption disabled.

Re: [PATCH] vmalloc: back off only when the current task is OOM killed

2017-10-10 Thread Tetsuo Handa

Michal Hocko wrote:
> On Tue 10-10-17 19:58:53, Tetsuo Handa wrote:
> > Commit 5d17a73a2ebeb8d1 ("vmalloc: back off when the current task is
> > killed") revealed two bugs [1] [2] that were not ready to fail vmalloc()
> > upon SIGKILL. But since the intent of that commit was to avoid unlimited
> > access to memory reserves, we should have checked tsk_is_oom_victim()
> > rather than fatal_signal_pending().
> > 
> > Note that even with commit cd04ae1e2dc8e365 ("mm, oom: do not rely on
> > TIF_MEMDIE for memory reserves access"), it is possible to trigger
> > "complete depletion of memory reserves"
> 
> How would that be possible? OOM victims are not allowed to consume whole
> reserves and the vmalloc context would have to do something utterly
> wrong like PF_MEMALLOC to make this happen. Protecting from such a code
> is simply pointless.

Oops. I was confused when writing that part.
Indeed, "complete" was demonstrated without commit cd04ae1e2dc8e365.

> 
> > and "extra OOM kills due to depletion of memory reserves"
> 
> and this is simply the case for the most vmalloc allocations because
> they are not reflected in the oom selection so if there is a massive
> vmalloc consumer it is very likely that we will kill a large part the
> userspace before hitting the user context on behalf which the vmalloc
> allocation is performed.

If there is a massive alloc_page() loop it is as well very likely that
we will kill a large part the userspace before hitting the user context
on behalf which the alloc_page() allocation is performed.

I think that massive vmalloc() consumers should be (as well as massive
alloc_page() consumers) careful such that they will be chosen as first OOM
victim, for vmalloc() does not abort as soon as an OOM occurs. Thus, I used
set_current_oom_origin()/clear_current_oom_origin() when I demonstrated
"complete" depletion.

> 
> I have tried to explain this is not really needed before but you keep
> insisting which is highly annoying. The patch as is is not harmful but
> it is simply _pointless_ IMHO.

Then, how can massive vmalloc() consumers become careful?
Explicitly use __vmalloc() and pass __GFP_NOMEMALLOC ?
Then, what about adding some comment like "Never try to allocate large
memory using plain vmalloc(). Use __vmalloc() with __GFP_NOMEMALLOC." ?

Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-10-10 Thread Tetsuo Handa

Wei Wang wrote:
> On 10/09/2017 11:20 PM, Michael S. Tsirkin wrote:
> > On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> >> +static inline void xb_set_page(struct virtio_balloon *vb,
> >> + struct page *page,
> >> + unsigned long *pfn_min,
> >> + unsigned long *pfn_max)
> >> +{
> >> +  unsigned long pfn = page_to_pfn(page);
> >> +
> >> +  *pfn_min = min(pfn, *pfn_min);
> >> +  *pfn_max = max(pfn, *pfn_max);
> >> +  xb_preload(GFP_KERNEL);
> >> +  xb_set_bit(&vb->page_xb, pfn);
> >> +  xb_preload_end();
> >> +}
> >> +
> > So, this will allocate memory
> >
> > ...
> >
> >> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon 
> >> *vb, size_t num)
> >>struct page *page;
> >>struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
> >>LIST_HEAD(pages);
> >> +  bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> >> +  unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
> >>   
> >> -  /* We can only do one array worth at a time. */
> >> -  num = min(num, ARRAY_SIZE(vb->pfns));
> >> +  /* Traditionally, we can only do one array worth at a time. */
> >> +  if (!use_sg)
> >> +  num = min(num, ARRAY_SIZE(vb->pfns));
> >>   
> >>mutex_lock(&vb->balloon_lock);
> >>/* We can't release more pages than taken */
> > And is sometimes called on OOM.
> >
> >
> > I suspect we need to
> >
> > 1. keep around some memory for leak on oom
> >
> > 2. for non oom allocate outside locks
> >
> >
> 
> I think maybe we can optimize the existing balloon logic, which could 
> remove the big balloon lock:
> 
> It would not be necessary to have the inflating and deflating run at the 
> same time.
> For example, 1st request to inflate 7G RAM, when 1GB has been given to 
> the host (so 6G left), the
> 2nd request to deflate 5G is received. Instead of waiting for the 1st 
> request to inflate 6G and then
> continuing with the 2nd request to deflate 5G, we can do a diff (6G to 
> inflate - 5G to deflate) immediately,
> and got 1G to inflate. In this way, all that driver will do is to simply 
> inflate another 1G.
> 
> Same for the OOM case: when OOM asks for 1G, while inflating 5G is in 
> progress, then the driver can
> deduct 1G from the amount that needs to inflate, and as a result, it 
> will inflate 4G.
> 
> In this case, we will never have the inflating and deflating task run at 
> the same time, so I think it is
> possible to remove the lock, and therefore, we will not have that 
> deadlock issue.
> 
> What would you guys think?

What is balloon_lock at virtballoon_migratepage() for?

  e22504296d4f64fb "virtio_balloon: introduce migration primitives to balloon 
pages"
  f68b992bbb474641 "virtio_balloon: fix race by fill and leak"

And even if we could remove balloon_lock, you still cannot use
__GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
"whether it is safe to wait" flag from
"[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .

[PATCH] vmalloc: back off only when the current task is OOM killed

2017-10-10 Thread Tetsuo Handa

Commit 5d17a73a2ebeb8d1 ("vmalloc: back off when the current task is
killed") revealed two bugs [1] [2] that were not ready to fail vmalloc()
upon SIGKILL. But since the intent of that commit was to avoid unlimited
access to memory reserves, we should have checked tsk_is_oom_victim()
rather than fatal_signal_pending().

Note that even with commit cd04ae1e2dc8e365 ("mm, oom: do not rely on
TIF_MEMDIE for memory reserves access"), it is possible to trigger
"complete depletion of memory reserves" and "extra OOM kills due to
depletion of memory reserves" by doing a large vmalloc() request if commit
5d17a73a2ebeb8d1 is reverted. Thus, let's keep checking tsk_is_oom_victim()
rather than removing fatal_signal_pending().

  [1] 
http://lkml.kernel.org/r/42eb5d53-5ceb-a9ce-791a-9469af308...@i-love.sakura.ne.jp
  [2] http://lkml.kernel.org/r/20171003225504.ga...@cmpxchg.org

Fixes: 5d17a73a2ebeb8d1 ("vmalloc: back off when the current task is killed")
Cc: stable # 4.11+
Signed-off-by: Tetsuo Handa 
---
 mm/vmalloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 8a43db6..6add29d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1695,7 +1696,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, 
gfp_t gfp_mask,
for (i = 0; i < area->nr_pages; i++) {
struct page *page;
 
-   if (fatal_signal_pending(current)) {
+   if (tsk_is_oom_victim(current)) {
area->nr_pages = i;
goto fail_no_warn;
}
-- 
1.8.3.1

Re: [PATCH v16 1/5] lib/xbitmap: Introduce xbitmap

2017-10-09 Thread Tetsuo Handa

On 2017/09/30 13:05, Wei Wang wrote:
>  /**
> + *  xb_preload - preload for xb_set_bit()
> + *  @gfp_mask: allocation mask to use for preloading
> + *
> + * Preallocate memory to use for the next call to xb_set_bit(). This function
> + * returns with preemption disabled. It will be enabled by xb_preload_end().
> + */
> +void xb_preload(gfp_t gfp)
> +{
> + if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0)
> + preempt_disable();
> +
> + if (!this_cpu_read(ida_bitmap)) {
> + struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
> +
> + if (!bitmap)
> + return;
> + bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
> + kfree(bitmap);
> + }
> +}

I'm not sure whether this function is safe.

__radix_tree_preload() returns 0 with preemption disabled upon success.
xb_preload() disables preemption if __radix_tree_preload() fails.
Then, kmalloc() is called with preemption disabled, isn't it?
But xb_set_page() calls xb_preload(GFP_KERNEL) which might sleep...

Re: [PATCH 1/2] Revert "vmalloc: back off when the current task is killed"

2017-10-07 Thread Tetsuo Handa

Michal Hocko wrote:
> On Sat 07-10-17 13:05:24, Tetsuo Handa wrote:
> > Johannes Weiner wrote:
> > > On Sat, Oct 07, 2017 at 11:21:26AM +0900, Tetsuo Handa wrote:
> > > > On 2017/10/05 19:36, Tetsuo Handa wrote:
> > > > > I don't want this patch backported. If you want to backport,
> > > > > "s/fatal_signal_pending/tsk_is_oom_victim/" is the safer way.
> > > > 
> > > > If you backport this patch, you will see "complete depletion of memory 
> > > > reserves"
> > > > and "extra OOM kills due to depletion of memory reserves" using below 
> > > > reproducer.
> > > > 
> > > > --
> > > > #include 
> > > > #include 
> > > > #include 
> > > > 
> > > > static char *buffer;
> > > > 
> > > > static int __init test_init(void)
> > > > {
> > > > set_current_oom_origin();
> > > > buffer = vmalloc((1UL << 32) - 480 * 1048576);
> > > 
> > > That's not a reproducer, that's a kernel module. It's not hard to
> > > crash the kernel from within the kernel.
> > > 
> > 
> > When did we agree that "reproducer" is "userspace program" ?
> > A "reproducer" is a program that triggers something intended.
> 
> This way of argumentation is just ridiculous. I can construct whatever
> code to put kernel on knees and there is no way around it.

But you don't distinguish between kernel module and userspace program.
What you distinguish is "real" and "theoretical". And, more you reject
with "ridiculous"/"theoretical", more I resist stronger.

> 
> The patch in question was supposed to mitigate a theoretical problem
> while it caused a real issue seen out there. That is a reason to
> revert the patch. Especially when a better mitigation has been put
> in place. You are right that replacing fatal_signal_pending by
> tsk_is_oom_victim would keep the original mitigation in pre-cd04ae1e2dc8
> kernels but I would only agree to do that if the mitigated problem was
> real. And this doesn't seem to be the case. If any of the stable kernels
> regresses due to the revert I am willing to put a mitigation in place.

The real issue here is that caller of vmalloc() was not ready to handle
allocation failure. We addressed kmem_zalloc_greedy() case
( https://marc.info/?l=linux-mm&m=148844910724880 ) by 08b005f1333154ae
rather than reverting fatal_signal_pending(). Removing
fatal_signal_pending() in order to hide real issues is a random hack.

>  
> > Year by year, people are spending efforts for kernel hardening.
> > It is silly to say that "It's not hard to crash the kernel from
> > within the kernel." when we can easily mitigate.
> 
> This is true but we do not spread random hacks around for problems that
> are not real and there are better ways to address them. In this
> particular case cd04ae1e2dc8 was a better way to address the problem in
> general without spreading tsk_is_oom_victim all over the place.

Using tsk_is_oom_victim() is reasonable for vmalloc() because it is a
memory allocation function which belongs to memory management subsystem.

>  
> > Even with cd04ae1e2dc8, there is no point with triggering extra
> > OOM kills by needlessly consuming memory reserves.
> 
> Yet again you are making unfounded claims and I am really fed up
> arguing discussing that any further.

Kernel hardening changes are mostly addressing "theoretical" issues
but we don't call them "ridiculous".

Re: [PATCH 1/2] Revert "vmalloc: back off when the current task is killed"

2017-10-06 Thread Tetsuo Handa

Johannes Weiner wrote:
> On Sat, Oct 07, 2017 at 11:21:26AM +0900, Tetsuo Handa wrote:
> > On 2017/10/05 19:36, Tetsuo Handa wrote:
> > > I don't want this patch backported. If you want to backport,
> > > "s/fatal_signal_pending/tsk_is_oom_victim/" is the safer way.
> > 
> > If you backport this patch, you will see "complete depletion of memory 
> > reserves"
> > and "extra OOM kills due to depletion of memory reserves" using below 
> > reproducer.
> > 
> > --
> > #include 
> > #include 
> > #include 
> > 
> > static char *buffer;
> > 
> > static int __init test_init(void)
> > {
> > set_current_oom_origin();
> > buffer = vmalloc((1UL << 32) - 480 * 1048576);
> 
> That's not a reproducer, that's a kernel module. It's not hard to
> crash the kernel from within the kernel.
> 

When did we agree that "reproducer" is "userspace program" ?
A "reproducer" is a program that triggers something intended.

Year by year, people are spending efforts for kernel hardening.
It is silly to say that "It's not hard to crash the kernel from
within the kernel." when we can easily mitigate.

Even with cd04ae1e2dc8, there is no point with triggering extra
OOM kills by needlessly consuming memory reserves.

Re: [PATCH 1/2] Revert "vmalloc: back off when the current task is killed"

2017-10-06 Thread Tetsuo Handa

On 2017/10/05 19:36, Tetsuo Handa wrote:
> I don't want this patch backported. If you want to backport,
> "s/fatal_signal_pending/tsk_is_oom_victim/" is the safer way.

If you backport this patch, you will see "complete depletion of memory reserves"
and "extra OOM kills due to depletion of memory reserves" using below 
reproducer.

--
#include 
#include 
#include 

static char *buffer;

static int __init test_init(void)
{
set_current_oom_origin();
buffer = vmalloc((1UL << 32) - 480 * 1048576);
clear_current_oom_origin();
return buffer ? 0 : -ENOMEM;
}

static void test_exit(void)
{
vfree(buffer);
}

module_init(test_init);
module_exit(test_exit);
MODULE_LICENSE("GPL");
--

--
CentOS Linux 7 (Core)
Kernel 4.13.5+ on an x86_64

ccsecurity login: [   53.637666] test: loading out-of-tree module taints kernel.
[   53.856166] insmod invoked oom-killer: 
gfp_mask=0x14002c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_NOWARN), nodemask=(null),  
order=0, oom_score_adj=0
[   53.858754] insmod cpuset=/ mems_allowed=0
[   53.859713] CPU: 1 PID: 2763 Comm: insmod Tainted: G   O4.13.5+ 
#10
[   53.861134] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 07/02/2015
[   53.863072] Call Trace:
[   53.863548]  dump_stack+0x4d/0x6f
[   53.864172]  dump_header+0x92/0x22a
[   53.864869]  ? has_ns_capability_noaudit+0x30/0x40
[   53.865887]  oom_kill_process+0x250/0x440
[   53.866644]  out_of_memory+0x10d/0x480
[   53.867343]  __alloc_pages_nodemask+0x1087/0x1140
[   53.868216]  alloc_pages_current+0x65/0xd0
[   53.869086]  __vmalloc_node_range+0x129/0x230
[   53.869895]  vmalloc+0x39/0x40
[   53.870472]  ? test_init+0x26/0x1000 [test]
[   53.871248]  test_init+0x26/0x1000 [test]
[   53.871993]  ? 0xa00fa000
[   53.872609]  do_one_initcall+0x4d/0x190
[   53.873301]  do_init_module+0x5a/0x1f7
[   53.873999]  load_module+0x2022/0x2960
[   53.874678]  ? vfs_read+0x116/0x130
[   53.875312]  SyS_finit_module+0xe1/0xf0
[   53.876074]  ? SyS_finit_module+0xe1/0xf0
[   53.876806]  do_syscall_64+0x5c/0x140
[   53.877488]  entry_SYSCALL64_slow_path+0x25/0x25
[   53.878316] RIP: 0033:0x7f1b27c877f9
[   53.878964] RSP: 002b:7552e718 EFLAGS: 0206 ORIG_RAX: 
0139
[   53.880620] RAX: ffda RBX: 00a2d210 RCX: 7f1b27c877f9
[   53.881883] RDX:  RSI: 0041a678 RDI: 0003
[   53.883167] RBP: 0041a678 R08:  R09: 7552e8b8
[   53.884685] R10: 0003 R11: 0206 R12: 
[   53.885949] R13: 00a2d1e0 R14:  R15: 
[   53.887392] Mem-Info:
[   53.887909] active_anon:14248 inactive_anon:2088 isolated_anon:0
[   53.887909]  active_file:4 inactive_file:2 isolated_file:2
[   53.887909]  unevictable:0 dirty:3 writeback:2 unstable:0
[   53.887909]  slab_reclaimable:2818 slab_unreclaimable:4420
[   53.887909]  mapped:453 shmem:2162 pagetables:1676 bounce:0
[   53.887909]  free:21418 free_pcp:0 free_cma:0
[   53.895172] Node 0 active_anon:56992kB inactive_anon:8352kB active_file:12kB 
inactive_file:12kB unevictable:0kB isolated(anon):0kB isolated(file):8kB 
mapped:1812kB dirty:12kB writeback:8kB shmem:8648kB shmem_thp: 0kB 
shmem_pmdmapped: 0kB anon_thp: 6144kB writeback_tmp:0kB unstable:0kB 
all_unreclaimable? no
[   53.901844] Node 0 DMA free:14932kB min:284kB low:352kB high:420kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB 
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB 
free_cma:0kB
[   53.907765] lowmem_reserve[]: 0 2703 3662 3662
[   53.909333] Node 0 DMA32 free:53424kB min:49684kB low:62104kB high:74524kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB writepending:0kB present:3129216kB managed:2790292kB 
mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB 
local_pcp:0kB free_cma:0kB
[   53.915597] lowmem_reserve[]: 0 0 958 958
[   53.916992] Node 0 Normal free:17192kB min:17608kB low:22008kB high:26408kB 
active_anon:56992kB inactive_anon:8352kB active_file:12kB inactive_file:12kB 
unevictable:0kB writepending:20kB present:1048576kB managed:981384kB 
mlocked:0kB kernel_stack:3648kB pagetables:6704kB bounce:0kB free_pcp:112kB 
local_pcp:0kB free_cma:0kB
[   53.924610] lowmem_reserve[]: 0 0 0 0
[   53.926131] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 0*32kB 1*64kB (U) 0*128kB 
0*256kB 1*512kB (U) 0*1024kB 1*2048kB (M) 3*4096kB (M) = 14932kB
[   53.929273] Node 0 DMA32: 4*4kB (UM) 2*8kB (UM) 5*16kB (UM) 4*32kB (M) 
3*64kB (M) 4*128kB (M) 5*256kB (UM) 4*512kB (M) 4*1024kB (UM) 2*2048kB (UM) 
10*4096kB (M) = 53424kB
[   53.934010] Node 0 Normal: 896*4kB (ME) 466*8kB (UME) 288*16kB (UME) 
128*32kB (UME) 23*64kB (UM) 0*12

Re: [lockdep] b09be676e0 BUG: unable to handle kernel NULL pointer dereference at 000001f2

2017-10-05 Thread Tetsuo Handa

Josh Poimboeuf wrote:
> On Wed, Oct 04, 2017 at 06:44:50AM +0900, Tetsuo Handa wrote:
> > Josh Poimboeuf wrote:
> > > On Tue, Oct 03, 2017 at 11:28:15AM -0500, Josh Poimboeuf wrote:
> > > > There are two bugs:
> > > > 
> > > > 1) Somebody -- presumably lockdep -- is corrupting the stack.  Need the
> > > >lockdep people to look at that.
> > > > 
> > > > 2) The 32-bit FP unwinder isn't handling the corrupt stack very well,
> > > >It's blindly dereferencing untrusted data:
> > > > 
> > > > /* Is the next frame pointer an encoded pointer to pt_regs? */
> > > > regs = decode_frame_pointer(next_bp);
> > > > if (regs) {
> > > > frame = (unsigned long *)regs;
> > > > len = regs_size(regs);
> > > > state->got_irq = true;
> > > > 
> > > >   On 32-bit, regs_size() dereferences the regs pointer before we know it
> > > >   points to a valid stack.  I'll fix that, along with the other unwinder
> > > >   improvements I discussed with Linus.
> > > 
> > > Tetsuo and/or Fengguang,
> > > 
> > > Would you mind testing with this patch?  It should at least prevent the
> > > unwinder panic and should hopefully print a useful unwinder dump
> > > instead.
> > > 
> > Here are two outputs.
> 
> Tetsuo, would you mind trying the following patch?
> 
Here are two outputs. Same kernel with different host hardware.

[6.406040] io scheduler noop registered
[6.406971] io scheduler deadline registered (default)
[6.432992] io scheduler cfq registered
[6.433275] io scheduler mq-deadline registered
[6.433905] io scheduler kyber registered
[6.557705] WARNING: kernel stack regs at f60bbe5c in swapper:1 has bad 'bp' 
value 0001
[6.557705] unwind stack type:0 next_sp:  (null) mask:0x2 graph_idx:0
[6.557705] f60bbd04: f60bbd54 (0xf60bbd54)
[6.557705] f60bbd08: c1020d6f (__save_stack_trace+0x6f/0xd0)
[6.557705] f60bbd0c: f60bbd54 (0xf60bbd54)
[6.557705] f60bbd10: 000c8040 (0xc8040)
[6.557705] f60bbd14:  ...
[6.557705] f60bbd18: f60ba000 (0xf60ba000)
[6.557705] f60bbd1c: f60bc000 (0xf60bc000)
[6.557705] f60bbd20:  ...
[6.557705] f60bbd24: 0002 (0x2)
[6.557705] f60bbd28: f60c8040 (0xf60c8040)
[6.557705] f60bbd2c:  ...
[6.557705] f60bbd30: 0101 (0x101)
[6.557705] f60bbd34:  ...
[6.557705] f60bbd38: f60bbd04 (0xf60bbd04)
[6.557705] f60bbd3c: c133a73d (atomic64_add_unless_cx8+0x21/0x38)
[6.557705] f60bbd40: f60bbe5c (0xf60bbe5c)
[6.557705] f60bbd44: 08e6d238 (0x8e6d238)
[6.557705] f60bbd48: 000371e0 (0x371e0)
[6.557705] f60bbd4c: d3eb9d15 (0xd3eb9d15)
[6.557705] f60bbd50: f60c8040 (0xf60c8040)
[6.557705] f60bbd54: f60bbd60 (0xf60bbd60)
[6.557705] f60bbd58: c1020dea (save_stack_trace+0x1a/0x20)
[6.557705] f60bbd5c:  ...
[6.557705] f60bbd60: f60bbdc4 (0xf60bbdc4)
[6.557705] f60bbd64: c1072676 (__lock_acquire+0xb56/0x1110)
[6.557705] f60bbd68:  ...
[6.557705] f60bbd6c: c1d86c44 (tk_core+0x4/0xf8)
[6.557705] f60bbd70: f60bbd8c (0xf60bbd8c)
[6.557705] f60bbd74: c106ee3c (find_held_lock+0x2c/0xa0)
[6.557705] f60bbd78: 0001 (0x1)
[6.557705] f60bbd7c: f60c8438 (0xf60c8438)
[6.557705] f60bbd80: f60c8040 (0xf60c8040)
[6.557705] f60bbd84: c1d86c44 (tk_core+0x4/0xf8)
[6.557705] f60bbd88: d3eb9d15 (0xd3eb9d15)
[6.557705] f60bbd8c: c19b82f0 (chainhash_table+0x18ff0/0x2)
[6.557705] f60bbd90: 21c04619 (0x21c04619)
[6.557705] f60bbd94: 0001 (0x1)
[6.557705] f60bbd98: f60c8438 (0xf60c8438)
[6.557705] f60bbd9c: 63fc (0x63fc)
[6.557705] f60bbda0:  ...
[6.557705] f60bbda8: 08e6d238 (0x8e6d238)
[6.557705] f60bbdac: 00200046 (0x200046)
[6.557705] f60bbdb0: f60bbdb8 (0xf60bbdb8)
[6.557705] f60bbdb4: 08e6d238 (0x8e6d238)
[6.557705] f60bbdb8: f60c8040 (0xf60c8040)
[6.557705] f60bbdbc: c14e6670 (hrtimer_bases+0x10/0x100)
[6.557705] f60bbdc0:  ...
[6.557705] f60bbdc4: f60bbdf8 (0xf60bbdf8)
[6.557705] f60bbdc8: c107338a (lock_acquire+0x7a/0xa0)
[6.557705] f60bbdcc:  ...
[6.557705] f60bbdd0: 0001 (0x1)
[6.557705] f60bbdd4: 0001 (0x1)
[6.557705] f60bbdd8:  ...
[6.557705] f60bbddc: c1090469 (hrtimer_interrupt+0x39/0x1c0)
[6.557705] f60bbde0:  ...
[6.557705] f60bbde8: 00200046 (0x200046)
[6.557705] f60bbdec: c14e6660 (hrtimer_debug_descr+0x20/0x20)
[6.557705] f60bbdf0: 0003 (0x3)
[6.557705] f60bbdf4:  ...
[6.557705] f60bbdf8: f60bbe14 (0xf60bbe14)
[6.557705] f60bb

Re: [PATCH 1/2] Revert "vmalloc: back off when the current task is killed"

2017-10-05 Thread Tetsuo Handa

On 2017/10/05 16:57, Michal Hocko wrote:
> On Wed 04-10-17 19:18:21, Johannes Weiner wrote:
>> On Wed, Oct 04, 2017 at 03:32:45PM -0700, Andrew Morton wrote:
> [...]
>>> You don't think they should be backported into -stables?
>>
>> Good point. For this one, it makes sense to CC stable, for 4.11 and
>> up. The second patch is more of a fortification against potential
>> future issues, and probably shouldn't go into stable.
> 
> I am not against. It is true that the memory reserves depletion fix was
> theoretical because I haven't seen any real life bug. I would argue that
> the more robust allocation failure behavior is a stable candidate as
> well, though, because the allocation can fail regardless of the vmalloc
> revert. It is less likely but still possible.
> 

I don't want this patch backported. If you want to backport,
"s/fatal_signal_pending/tsk_is_oom_victim/" is the safer way.

On 2017/10/04 17:33, Michal Hocko wrote:
> Now that we have cd04ae1e2dc8 ("mm, oom: do not rely on TIF_MEMDIE for
> memory reserves access") the risk of the memory depletion is much
> smaller so reverting the above commit should be acceptable. 

Are you aware that stable kernels do not have cd04ae1e2dc8 ?

We added fatal_signal_pending() check inside read()/write() loop
because one read()/write() request could consume 2GB of kernel memory.

What if there is a kernel module which uses vmalloc(1GB) from some
ioctl() for legitimate reason? You are going to allow such vmalloc()
calls to deplete memory reserves completely.

On 2017/10/05 8:21, Johannes Weiner wrote:
> Generally, we should leave it to the page allocator to handle memory
> reserves, not annotate random alloc_page() callsites.

I disagree. Interrupting the loop as soon as possible is preferable.

Since we don't have __GFP_KILLABLE, we had to do fatal_signal_pending()
check inside read()/write() loop. Since vmalloc() resembles read()/write()
in a sense that it can consume GB of memory, it is pointless to expect
the caller of vmalloc() to check tsk_is_oom_victim().

Again, checking tsk_is_oom_victim() inside vmalloc() loop is the better.

Re: [PATCH 1/2] Revert "vmalloc: back off when the current task is killed"

2017-10-04 Thread Tetsuo Handa

Johannes Weiner wrote:
> On Thu, Oct 05, 2017 at 05:49:43AM +0900, Tetsuo Handa wrote:
> > On 2017/10/05 3:59, Johannes Weiner wrote:
> > > But the justification to make that vmalloc() call fail like this isn't
> > > convincing, either. The patch mentions an OOM victim exhausting the
> > > memory reserves and thus deadlocking the machine. But the OOM killer
> > > is only one, improbable source of fatal signals. It doesn't make sense
> > > to fail allocations preemptively with plenty of memory in most cases.
> > 
> > By the time the current thread reaches do_exit(), 
> > fatal_signal_pending(current)
> > should become false. As far as I can guess, the source of fatal signal will 
> > be
> > tty_signal_session_leader(tty, exit_session) which is called just before
> > tty_ldisc_hangup(tty, cons_filp != NULL) rather than the OOM killer. I don't
> > know whether it is possible to make fatal_signal_pending(current) true 
> > inside
> > do_exit() though...
> 
> It's definitely not the OOM killer, the memory situation looks fine
> when this happens. I didn't look closer where the signal comes from.
> 

Then, we could check tsk_is_oom_victim() instead of fatal_signal_pending().

> That said, we trigger this issue fairly easily. We tested the revert
> over night on a couple thousand machines, and it fixed the issue
> (whereas the control group still saw the crashes).
>

Re: [PATCH 1/2] Revert "vmalloc: back off when the current task is killed"

2017-10-04 Thread Tetsuo Handa

On 2017/10/05 3:59, Johannes Weiner wrote:
> But the justification to make that vmalloc() call fail like this isn't
> convincing, either. The patch mentions an OOM victim exhausting the
> memory reserves and thus deadlocking the machine. But the OOM killer
> is only one, improbable source of fatal signals. It doesn't make sense
> to fail allocations preemptively with plenty of memory in most cases.

By the time the current thread reaches do_exit(), fatal_signal_pending(current)
should become false. As far as I can guess, the source of fatal signal will be
tty_signal_session_leader(tty, exit_session) which is called just before
tty_ldisc_hangup(tty, cons_filp != NULL) rather than the OOM killer. I don't
know whether it is possible to make fatal_signal_pending(current) true inside
do_exit() though...

Re: [lockdep] b09be676e0 BUG: unable to handle kernel NULL pointer dereference at 000001f2

2017-10-03 Thread Tetsuo Handa

Josh Poimboeuf wrote:
> On Tue, Oct 03, 2017 at 11:28:15AM -0500, Josh Poimboeuf wrote:
> > There are two bugs:
> > 
> > 1) Somebody -- presumably lockdep -- is corrupting the stack.  Need the
> >lockdep people to look at that.
> > 
> > 2) The 32-bit FP unwinder isn't handling the corrupt stack very well,
> >It's blindly dereferencing untrusted data:
> > 
> > /* Is the next frame pointer an encoded pointer to pt_regs? */
> > regs = decode_frame_pointer(next_bp);
> > if (regs) {
> > frame = (unsigned long *)regs;
> > len = regs_size(regs);
> > state->got_irq = true;
> > 
> >   On 32-bit, regs_size() dereferences the regs pointer before we know it
> >   points to a valid stack.  I'll fix that, along with the other unwinder
> >   improvements I discussed with Linus.
> 
> Tetsuo and/or Fengguang,
> 
> Would you mind testing with this patch?  It should at least prevent the
> unwinder panic and should hopefully print a useful unwinder dump
> instead.
> 
Here are two outputs.

[0.550226] ACPI: bus type PCI registered
[0.551118] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[0.564798] PCI: PCI BIOS revision 2.10 entry at 0xfd54b, last bus=0
[0.565121] PCI: Using configuration type 1 for base access
[0.757556] ACPI: Added _OSI(Module Device)
[0.757720] ACPI: Added _OSI(Processor Device)
[0.758094] ACPI: Added _OSI(3.0 _SCP Extensions)
[0.758234] ACPI: Added _OSI(Processor Aggregator Device)
[1.203043] ACPI: Interpreter enabled
[1.205699] ACPI: (supports S0 S5)
[1.207264] ACPI: Using IOAPIC for interrupt routing
[1.209000] PCI: Using host bridge windows from ACPI; if necessary, use 
"pci=nocrs" and report a bug
[1.234458] ACPI: Enabled 16 GPEs in block 00 to 0F
[1.275000] WARNING: kernel stack regs at f60bb9c8 in swapper:1 has bad 'bp' 
value 0ba0
[1.275000] unwind stack type:0 next_sp:  (null) mask:0x2 graph_idx:0
[1.275000] f60bb9c8: 0bba19f6 (0xbba19f6)
[1.275000] f60bb9cc: 020d4bf6 (0x20d4bf6)
[1.275000] f60bb9d0: 0bba19c1 (0xbba19c1)
[1.275000] f60bb9d4: e1f6 (0xe1f6)
[1.275000] f60bb9d8:  ...
[1.275000] f60bb9dc: 0ba0 (0xba0)
[1.275000] f60bb9e0: 0bc000f6 (0xbc000f6)
[1.275000] f60bb9e4: 00f6 (0xf6)
[1.275000] f60bb9e8: 0200 (0x200)
[1.275000] f60bb9ec: 0c804000 (0xc804000)
[1.275000] f60bb9f0: 00f6 (0xf6)
[1.275000] f60bb9f4: 00010100 (0x10100)
[1.275000] f60bb9f8:  ...
[1.275000] f60bb9fc: 0bb9c800 (0xbb9c800)
[1.275000] f60bba00: 00f6 (0xf6)
[1.275000] f60bba04: 0bb9c800 (0xbb9c800)
[1.275000] f60bba08: e6d238f6 (0xe6d238f6)
[1.275000] f60bba0c: 00e12a08 (0xe12a08)
[1.275000] f60bba10: 78122b00 (0x78122b00)
[1.275000] f60bba14: 0c8040c7 (0xc8040c7)
[1.275000] f60bba18: 0bba25f6 (0xbba25f6)
[1.275000] f60bba1c: 020deaf6 (0x20deaf6)
[1.275000] f60bba20: 00c1 (0xc1)
[1.275000] f60bba24: 0bba8900 (0xbba8900)
[1.275000] f60bba28: 072696f6 (0x72696f6)
[1.275000] f60bba2c: 00c1 (0xc1)
[1.275000] f60bba30: 0008 (0x8)
[1.275000] f60bba34: f60bbae0 (0xf60bbae0)
[1.275000] f60bba38: f60bbe6c (0xf60bbe6c)
[1.275000] f60bba3c: 08e6d238 (0x8e6d238)
[1.275000] f60bba40: 0002 (0x2)
[1.275000] f60bba44: 0001 (0x1)
[1.275000] f60bba48: f6000690 (0xf6000690)
[1.275000] f60bba4c: 78122b68 (0x78122b68)
[1.275000] f60bba50: 9b43ccc7 (0x9b43ccc7)
[1.275000] f60bba54: a9bcbec1 (0xa9bcbec1)
[1.275000] f60bba58: 017b (0x17b)
[1.275000] f60bba5c: 0c845800 (0xc845800)
[1.275000] f60bba60: 005433f6 (0x5433f6)
[1.275000] f60bba64: 0100 (0x100)
[1.275000] f60bba68: f60bba00 (0xf60bba00)
[1.275000] f60bba6c: 00200046 (0x200046)
[1.275000] f60bba70: f60bba80 (0xf60bba80)
[1.275000] f60bba74: f60c8458 (0xf60c8458)
[1.275000] f60bba78: e6d2385a (0xe6d2385a)
[1.275000] f60bba7c: 0c804008 (0xc804008)
[1.275000] f60bba80: 4e6d28f6 (0x4e6d28f6)
[1.275000] f60bba84: 00c1 (0xc1)
[1.275000] f60bba88: 0bbabd00 (0xbbabd00)
[1.275000] f60bba8c: 0733aaf6 (0x733aaf6)
[1.275000] f60bba90: 00c1 (0xc1)
[1.275000] f60bba94: 0100 (0x100)
[1.275000] f60bba98: 0100 (0x100)
[1.275000] f60bba9c:  ...
[1.275000] f60bbaa0: 09d84200 (0x9d84200)
[1.275000] f60bbaa4: 00c1 (0xc1)
[1.275000] f60bbaa8:  ...
[1.275000] f60bbaac: 20004600 (0x20004600)
[1.275000] f60bbab0: 4e6d1800 (0x4e6d1800)
[1.275000] f60bbab4: ffc1 (0xffc1)
[1.275000] f60bbab8:  (0x)
[1.275000] f60bbabc: 0bbad97f (0xbbad97f)
[1.275000] f60bbac0: 341d83f6 (0x341d83f6)
[1.275000] f60bbac4: 00c1 (0xc1)
[1.275000] f60bbac8: 0100 (0x100)
[1.275000] f60bbacc:  ...
[1.275000] f60bbad0: 09d84200 (0x9d84200)
[1.275000] f60bbad4: 4da8c0c1 (0x4da8c0c1)
[1.275000] f60

Re: [4.14-rc1 x86] WARNING: kernel stack regs at f60bbb12 in swapper:1 has bad 'bp' value 0ba00000

2017-10-03 Thread Tetsuo Handa

Josh Poimboeuf wrote:
> On Tue, Oct 03, 2017 at 09:35:18AM -0500, Josh Poimboeuf wrote:
> > On Tue, Oct 03, 2017 at 10:44:13PM +0900, Tetsuo Handa wrote:
> > > Josh Poimboeuf wrote:
> > > 
> > > > On Tue, Oct 03, 2017 at 12:37:44PM +0200, Borislav Petkov wrote:
> > > > > On Tue, Oct 03, 2017 at 07:29:36PM +0900, Tetsuo Handa wrote:
> > > > > > Tetsuo Handa wrote:
> > > > > > > Tetsuo Handa wrote:
> > > > > > > > Tetsuo Handa wrote:
> > > > > > > > > I'm seeing below error between
> > > > > > > > > 4898b99c261efe32 ("Merge tag 'acpi-4.13-rc7' of 
> > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm")
> > > > > > > > >  (git bisect good (presumably))
> > > > > > > > > e6f3faa734a00c60 ("locking/lockdep: Fix workqueue 
> > > > > > > > > crossrelease annotation") (git bisect bad) on linux.git .
> > > > > > > > 
> > > > > > > > F.Y.I. This error remains as of 46c1e79fee417f15 ("Merge branch 
> > > > > > > > 'perf-urgent-for-linus' of
> > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip") on 
> > > > > > > > linux.git .
> > > > > > > > 
> > > > > > > 
> > > > > > > This error still remains as of 6e80ecdddf4ea6f3 ("Merge branch 
> > > > > > > 'libnvdimm-fixes'
> > > > > > > of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm") 
> > > > > > > on linux.git .
> > > > > > > 
> > > > > > > I'm suspecting that this error is causing very unstable x86_32 
> > > > > > > kernel.
> > > > > > > It seems that this error occurs (though rare frequency) even on 
> > > > > > > x86_64 kernel.
> > > > > > > 
> > > > > > > Nobody cares?
> > > > > > > 
> > > > > > 4.14-rc3 still trivially panics due to this error. Is this problem 
> > > > > > known?
> > > > 
> > > > Can you try with the following patch?  It should hopefully give more
> > > > useful information in the dump.
> > > > 
> > > I see. Here is the result.
> > 
> > Hm, that's not what I expected to happen...  I suspect this is stack
> > corruption, with the result being slightly different every time.  Can
> > you see if this patch fixes the panic?
> 
> On second thought, I don't think that's the right fix.  But I do think
> what you're seeing is related to a lockdep issue:
> 
>   
> https://lkml.kernel.org/r/20171003140634.r2jzujgl62ox4...@wfg-t540p.sh.intel.com

Oh, very likely. That commit is in "git log 
4898b99c261efe32...e6f3faa734a00c60" range.

> 
> I'm not sure yet why it's breaking the unwinder so badly though.
> 
> -- 
> Josh
>

Re: [4.14-rc1 x86] WARNING: kernel stack regs at f60bbb12 in swapper:1 has bad 'bp' value 0ba00000

2017-10-03 Thread Tetsuo Handa

Josh Poimboeuf wrote:
> On Tue, Oct 03, 2017 at 10:44:13PM +0900, Tetsuo Handa wrote:
> > Josh Poimboeuf wrote:
> > 
> > > On Tue, Oct 03, 2017 at 12:37:44PM +0200, Borislav Petkov wrote:
> > > > On Tue, Oct 03, 2017 at 07:29:36PM +0900, Tetsuo Handa wrote:
> > > > > Tetsuo Handa wrote:
> > > > > > Tetsuo Handa wrote:
> > > > > > > Tetsuo Handa wrote:
> > > > > > > > I'm seeing below error between
> > > > > > > > 4898b99c261efe32 ("Merge tag 'acpi-4.13-rc7' of 
> > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm") 
> > > > > > > > (git bisect good (presumably))
> > > > > > > > e6f3faa734a00c60 ("locking/lockdep: Fix workqueue crossrelease 
> > > > > > > > annotation") (git bisect bad) on linux.git .
> > > > > > > 
> > > > > > > F.Y.I. This error remains as of 46c1e79fee417f15 ("Merge branch 
> > > > > > > 'perf-urgent-for-linus' of
> > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip") on 
> > > > > > > linux.git .
> > > > > > > 
> > > > > > 
> > > > > > This error still remains as of 6e80ecdddf4ea6f3 ("Merge branch 
> > > > > > 'libnvdimm-fixes'
> > > > > > of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm") on 
> > > > > > linux.git .
> > > > > > 
> > > > > > I'm suspecting that this error is causing very unstable x86_32 
> > > > > > kernel.
> > > > > > It seems that this error occurs (though rare frequency) even on 
> > > > > > x86_64 kernel.
> > > > > > 
> > > > > > Nobody cares?
> > > > > > 
> > > > > 4.14-rc3 still trivially panics due to this error. Is this problem 
> > > > > known?
> > > 
> > > Can you try with the following patch?  It should hopefully give more
> > > useful information in the dump.
> > > 
> > I see. Here is the result.
> 
> Hm, that's not what I expected to happen...  I suspect this is stack
> corruption, with the result being slightly different every time.  Can
> you see if this patch fixes the panic?

This patch did not fix the problem. But disabling CONFIG_PROVE_LOCKING seems
to avoid this problem. Since "git log 4898b99c261efe32...e6f3faa734a00c60"
range includes lockdep changes, this might be a lockdep problem.

--
# diff .config.old .config
2132c2132
< CONFIG_PROVE_LOCKING=y
---
> # CONFIG_PROVE_LOCKING is not set
2135,2136d2134
< CONFIG_LOCKDEP_CROSSRELEASE=y
< CONFIG_LOCKDEP_COMPLETIONS=y
2142d2139
< CONFIG_TRACE_IRQFLAGS=y
2157c2154
< CONFIG_PROVE_RCU=y
---
> # CONFIG_PROVE_RCU is not set
--

Maybe there is a bug in completion and/or crossrelease handling?

Re: [4.14-rc1 x86] WARNING: kernel stack regs at f60bbb12 inswapper:1 has bad 'bp' value 0ba00000

2017-10-03 Thread Tetsuo Handa

Josh Poimboeuf wrote:

> On Tue, Oct 03, 2017 at 12:37:44PM +0200, Borislav Petkov wrote:
> > On Tue, Oct 03, 2017 at 07:29:36PM +0900, Tetsuo Handa wrote:
> > > Tetsuo Handa wrote:
> > > > Tetsuo Handa wrote:
> > > > > Tetsuo Handa wrote:
> > > > > > I'm seeing below error between
> > > > > > 4898b99c261efe32 ("Merge tag 'acpi-4.13-rc7' of 
> > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm") 
> > > > > > (git bisect good (presumably))
> > > > > > e6f3faa734a00c60 ("locking/lockdep: Fix workqueue crossrelease 
> > > > > > annotation") (git bisect bad) on linux.git .
> > > > > 
> > > > > F.Y.I. This error remains as of 46c1e79fee417f15 ("Merge branch 
> > > > > 'perf-urgent-for-linus' of
> > > > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip") on linux.git .
> > > > > 
> > > > 
> > > > This error still remains as of 6e80ecdddf4ea6f3 ("Merge branch 
> > > > 'libnvdimm-fixes'
> > > > of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm") on 
> > > > linux.git .
> > > > 
> > > > I'm suspecting that this error is causing very unstable x86_32 kernel.
> > > > It seems that this error occurs (though rare frequency) even on x86_64 
> > > > kernel.
> > > > 
> > > > Nobody cares?
> > > > 
> > > 4.14-rc3 still trivially panics due to this error. Is this problem known?
> 
> Can you try with the following patch?  It should hopefully give more
> useful information in the dump.
> 
I see. Here is the result.

# /usr/libexec/qemu-kvm -no-kvm -cpu kvm32 -smp 1 -m 2048 --no-reboot --kernel 
arch/x86/boot/bzImage --nographic --append "panic=1 console=ttyS0,115200n8"
[0.00] Linux version 4.14.0-rc3+ (root@ccsecurity) (gcc version 4.8.5 
20150623 (Red Hat 4.8.5-16) (GCC)) #154 Tue Oct 3 22:37:16 JST 2017
[0.00] x86/fpu: x87 FPU will use FXSAVE
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x7fffdfff] usable
[0.00] BIOS-e820: [mem 0x7fffe000-0x7fff] reserved
[0.00] BIOS-e820: [mem 0xfffc-0x] reserved
[0.00] Notice: NX (Execute Disable) protection missing in CPU!
[0.00] random: fast init done
[0.00] SMBIOS 2.4 present.
[0.00] DMI: Red Hat KVM, BIOS 0.5.1 01/01/2011
[0.00] tsc: Unable to calibrate against PIT
[0.00] tsc: No reference (HPET/PMTIMER) available
[0.00] e820: last_pfn = 0x7fffe max_arch_pfn = 0x10
[0.00] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- UC
[0.00] found SMP MP-table at [mem 0x000f7310-0x000f731f] mapped at 
[ffda2310]
[0.00] Scanning 1 areas for low memory corruption
[0.00] ACPI: Early table checksum verification disabled
[0.00] ACPI: RSDP 0x000F7170 14 (v00 BOCHS )
[0.00] ACPI: RSDT 0x7A9B 30 (v01 BOCHS  BXPCRSDT 
0001 BXPC 0001)
[0.00] ACPI: FACP 0x7177 74 (v01 BOCHS  BXPCFACP 
0001 BXPC 0001)
[0.00] ACPI: DSDT 0x7FFFE040 001137 (v01 BOCHS  BXPCDSDT 
0001 BXPC 0001)
[0.00] ACPI: FACS 0x7FFFE000 40
[0.00] ACPI: SSDT 0x71EB 000838 (v01 BOCHS  BXPCSSDT 
0001 BXPC 0001)
[0.00] ACPI: APIC 0x7A23 78 (v01 BOCHS  BXPCAPIC 
0001 BXPC 0001)
[0.00] 1160MB HIGHMEM available.
[0.00] 887MB LOWMEM available.
[0.00]   mapped low ram: 0 - 377fe000
[0.00]   low ram: 0 - 377fe000
[0.00] Zone ranges:
[0.00]   DMA  [mem 0x1000-0x00ff]
[0.00]   Normal   [mem 0x0100-0x377fdfff]
[0.00]   HighMem  [mem 0x377fe000-0x7fffdfff]
[0.00] Movable zone start for each node
[0.00] Early memory node ranges
[0.00]   node   0: [mem 0x1000-0x0009efff]
[0.00]   node   0: [mem 0x0010-0x7fffdfff]
[0.00] Initmem setup node 0 [mem 0x1000-0x7fffdfff]
[0.00] Using APIC driver default
[0.00] ACPI: PM-Timer IO Port: 0x608
[0.00] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
[0.00] IOAPIC[0]: apic_id 0 already used,

Re: [4.14-rc1 x86] WARNING: kernel stack regs at f60bbb12 in swapper:1 has bad 'bp' value 0ba00000

2017-10-03 Thread Tetsuo Handa

Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > > I'm seeing below error between
> > > 4898b99c261efe32 ("Merge tag 'acpi-4.13-rc7' of 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm") (git 
> > > bisect good (presumably))
> > > e6f3faa734a00c60 ("locking/lockdep: Fix workqueue crossrelease 
> > > annotation") (git bisect bad) on linux.git .
> > 
> > F.Y.I. This error remains as of 46c1e79fee417f15 ("Merge branch 
> > 'perf-urgent-for-linus' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip") on linux.git .
> > 
> 
> This error still remains as of 6e80ecdddf4ea6f3 ("Merge branch 
> 'libnvdimm-fixes'
> of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm") on linux.git 
> .
> 
> I'm suspecting that this error is causing very unstable x86_32 kernel.
> It seems that this error occurs (though rare frequency) even on x86_64 kernel.
> 
> Nobody cares?
> 
4.14-rc3 still trivially panics due to this error. Is this problem known?

Re: 4.14-rc2 on thinkpad x220: out of memory when inserting mmc card

2017-10-02 Thread Tetsuo Handa

Michal Hocko wrote:
> On Sun 01-10-17 12:26:47, Pavel Machek wrote:
> > Hi!
> > 
> > > I inserted u-SD card, only to realize that it is not detected as it
> > > should be. And dmesg indeed reveals:
> > 
> > Tetsuo asked me to report this to linux-mm.
> > 
> > But 2^4 is 16 pages, IIRC that can't be expected to work reliably, and
> > thus this sounds like MMC bug, not mm bug.
> 
> Well, I cannot comment on why MMC needs such a large allocation and
> whether it can safely fall back to vmalloc but __GFP_RETRY_MAYFAIL
> might help to try harder and require compaction to do more work.
> Relying on that for correctness is, of course, a different story and
> a very unreliable under memory pressure or long term fragmented memory.

Linus Walleij answered that kvmalloc() is against the design of the bounce 
buffer at
http://lkml.kernel.org/r/cacrpkdyirc+rh_kalgvqkzmjq2dgbw4oi9mjkmrzwn+1o+9...@mail.gmail.com
 .

Re: [v8 0/4] cgroup-aware OOM killer

2017-10-02 Thread Tetsuo Handa

Shakeel Butt wrote:
> I think Tim has given very clear explanation why comparing A & D makes
> perfect sense. However I think the above example, a single user system
> where a user has designed and created the whole hierarchy and then
> attaches different jobs/applications to different nodes in this
> hierarchy, is also a valid scenario. One solution I can think of, to
> cater both scenarios, is to introduce a notion of 'bypass oom' or not
> include a memcg for oom comparision and instead include its children
> in the comparison.

I'm not catching up to this thread because I don't use memcg.
But if there are multiple scenarios, what about offloading memcg OOM
handling to loadable kernel modules (like there are many filesystems
which are called by VFS interface) ? We can do try and error more casually.

Re: 4.14-rc2 on thinkpad x220: out of memory when inserting mmc card

2017-10-01 Thread Tetsuo Handa

Pavel Machek wrote:
> Hi!
> 
> > I inserted u-SD card, only to realize that it is not detected as it
> > should be. And dmesg indeed reveals:
> 
> Tetsuo asked me to report this to linux-mm.
> 
> But 2^4 is 16 pages, IIRC that can't be expected to work reliably, and
> thus this sounds like MMC bug, not mm bug.

Yes, 16 pages is costly allocations which will fail without invoking the
OOM killer. But I thought this is an interesting case, for mempool
allocation should be able to handle memory allocation failure except
initial allocations, and initial allocation is failing.

I think that using kvmalloc() (and converting corresponding kfree() to
kvfree()) will make initial allocations succeed, but that might cause
needlessly succeeding subsequent mempool allocations under memory pressure?

> 
> > [10994.299846] mmc0: new high speed SDHC card at address 0003
> > [10994.302196] kworker/2:1: page allocation failure: order:4,
> > mode:0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=(null)
> > [10994.302212] CPU: 2 PID: 9500 Comm: kworker/2:1 Not tainted
> > 4.14.0-rc2 #135
> > [10994.302215] Hardware name: LENOVO 42872WU/42872WU, BIOS 8DET73WW
> > (1.43 ) 10/12/2016
> > [10994.30] Workqueue: events_freezable mmc_rescan
> > [10994.302227] Call Trace:
> > [10994.302233]  dump_stack+0x4d/0x67
> > [10994.302239]  warn_alloc+0xde/0x180
> > [10994.302243]  __alloc_pages_nodemask+0xaa4/0xd30
> > [10994.302249]  ? cache_alloc_refill+0xb73/0xc10
> > [10994.302252]  cache_alloc_refill+0x101/0xc10
> > [10994.302258]  ? mmc_init_request+0x2d/0xd0
> > [10994.302262]  ? mmc_init_request+0x2d/0xd0
> > [10994.302265]  __kmalloc+0xaf/0xe0
> > [10994.302269]  mmc_init_request+0x2d/0xd0
> > [10994.302273]  alloc_request_size+0x45/0x60
> > [10994.302276]  ? free_request_size+0x30/0x30
> > [10994.302280]  mempool_create_node+0xd7/0x130
> > [10994.302283]  ? alloc_request_simple+0x20/0x20
> > [10994.302287]  blk_init_rl+0xe8/0x110
> > [10994.302290]  blk_init_allocated_queue+0x70/0x180
> > [10994.302294]  mmc_init_queue+0xdd/0x370
> > [10994.302297]  mmc_blk_alloc_req+0xf6/0x340
> > [10994.302301]  mmc_blk_probe+0x18b/0x4e0
> > [10994.302305]  mmc_bus_probe+0x12/0x20
> > [10994.302309]  driver_probe_device+0x2f4/0x490
> > 
> > Order 4 allocations are not supposed to be reliable...
> > 
> > Any ideas?
> > 
> > Thanks,
> > Pavel
> > 
> 
> 
> 
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) 
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Re: [PATCH 0/2 v8] oom: capture unreclaimable slab info in oom message

2017-09-30 Thread Tetsuo Handa

Yang Shi wrote:
> On 9/28/17 1:45 PM, Tetsuo Handa wrote:
> > Yang Shi wrote:
> >> On 9/28/17 12:57 PM, Tetsuo Handa wrote:
> >>> Yang Shi wrote:
> >>>> On 9/27/17 9:36 PM, Tetsuo Handa wrote:
> >>>>> On 2017/09/28 6:46, Yang Shi wrote:
> >>>>>> Changelog v7 -> v8:
> >>>>>> * Adopted Michal’s suggestion to dump unreclaim slab info when 
> >>>>>> unreclaimable slabs amount > total user memory. Not only in oom panic 
> >>>>>> path.
> >>>>>
> >>>>> Holding slab_mutex inside dump_unreclaimable_slab() was refrained since 
> >>>>> V2
> >>>>> because there are
> >>>>>
> >>>>> mutex_lock(&slab_mutex);
> >>>>> kmalloc(GFP_KERNEL);
> >>>>> mutex_unlock(&slab_mutex);
> >>>>>
> >>>>> users. If we call dump_unreclaimable_slab() for non OOM panic path, 
> >>>>> aren't we
> >>>>> introducing a risk of crash (i.e. kernel panic) for regular OOM path?
> >>>>
> >>>> I don't see the difference between regular oom path and oom path other
> >>>> than calling panic() at last.
> >>>>
> >>>> And, the slab dump may be called by panic path too, it is for both
> >>>> regular and panic path.
> >>>
> >>> Calling a function that might cause kerneloops immediately before calling 
> >>> panic()
> >>> would be tolerable, for the kernel will panic after all. But calling a 
> >>> function
> >>> that might cause kerneloops when there is no plan to call panic() is a 
> >>> bug.
> >>
> >> I got your point. slab_mutex is used to protect the list of all the
> >> slabs, since we are already in oom, there should be not kmem cache
> >> destroy happen during the list traverse. And, list_for_each_entry() has
> >> been replaced to list_for_each_entry_safe() to make the traverse more
> >> robust.
> > 
> > I consider that OOM event and kmem chache destroy event can run concurrently
> > because slab_mutex is not held by OOM event (and unfortunately cannot be 
> > held
> > due to possibility of deadlock) in order to protect the list of all the 
> > slabs.
> > 
> > I don't think replacing list_for_each_entry() with 
> > list_for_each_entry_safe()
> > makes the traverse more robust, for list_for_each_entry_safe() does not 
> > defer
> > freeing of memory used by list element. Rather, replacing 
> > list_for_each_entry()
> > with list_for_each_entry_rcu() (and making relevant changes such as
> > rcu_read_lock()/rcu_read_unlock()/synchronize_rcu()) will make the traverse 
> > safe.
> 
> I'm not sure if rcu could satisfy this case. rcu just can protect  
> slab_caches_to_rcu_destroy list, which is used by SLAB_TYPESAFE_BY_RCU  
> slabs.

I'm not sure why you are talking about SLAB_TYPESAFE_BY_RCU.
What I meant is that

  Upon registration:

// do initialize/setup stuff here
synchronize_rcu(); // <= for dump_unreclaimable_slab()
list_add_rcu(&kmem_cache->list, &slab_caches);

  Upon unregistration:

list_del_rcu(&kmem_cache->list);
synchronize_rcu(); // <= for dump_unreclaimable_slab()
// do finalize/cleanup stuff here

then (if my understanding is correct)

rcu_read_lock();
list_for_each_entry_rcu(s, &slab_caches, list) {
if (!is_root_cache(s) || (s->flags & SLAB_RECLAIM_ACCOUNT))
continue;

memset(&sinfo, 0, sizeof(sinfo));
get_slabinfo(s, &sinfo);

if (sinfo.num_objs > 0)
pr_info("%-17s %10luKB %10luKB\n", cache_name(s),
(sinfo.active_objs * s->size) / 1024,
(sinfo.num_objs * s->size) / 1024);
}
rcu_read_unlock();

will make dump_unreclaimable_slab() safe.

Re: [PATCH 0/2 v8] oom: capture unreclaimable slab info in oom message

2017-09-28 Thread Tetsuo Handa

Yang Shi wrote:
> On 9/28/17 12:57 PM, Tetsuo Handa wrote:
> > Yang Shi wrote:
> >> On 9/27/17 9:36 PM, Tetsuo Handa wrote:
> >>> On 2017/09/28 6:46, Yang Shi wrote:
> >>>> Changelog v7 -> v8:
> >>>> * Adopted Michal’s suggestion to dump unreclaim slab info when 
> >>>> unreclaimable slabs amount > total user memory. Not only in oom panic 
> >>>> path.
> >>>
> >>> Holding slab_mutex inside dump_unreclaimable_slab() was refrained since V2
> >>> because there are
> >>>
> >>>   mutex_lock(&slab_mutex);
> >>>   kmalloc(GFP_KERNEL);
> >>>   mutex_unlock(&slab_mutex);
> >>>
> >>> users. If we call dump_unreclaimable_slab() for non OOM panic path, 
> >>> aren't we
> >>> introducing a risk of crash (i.e. kernel panic) for regular OOM path?
> >>
> >> I don't see the difference between regular oom path and oom path other
> >> than calling panic() at last.
> >>
> >> And, the slab dump may be called by panic path too, it is for both
> >> regular and panic path.
> > 
> > Calling a function that might cause kerneloops immediately before calling 
> > panic()
> > would be tolerable, for the kernel will panic after all. But calling a 
> > function
> > that might cause kerneloops when there is no plan to call panic() is a bug.
> 
> I got your point. slab_mutex is used to protect the list of all the  
> slabs, since we are already in oom, there should be not kmem cache  
> destroy happen during the list traverse. And, list_for_each_entry() has  
> been replaced to list_for_each_entry_safe() to make the traverse more  
> robust.

I consider that OOM event and kmem chache destroy event can run concurrently
because slab_mutex is not held by OOM event (and unfortunately cannot be held
due to possibility of deadlock) in order to protect the list of all the slabs.

I don't think replacing list_for_each_entry() with list_for_each_entry_safe()
makes the traverse more robust, for list_for_each_entry_safe() does not defer
freeing of memory used by list element. Rather, replacing list_for_each_entry()
with list_for_each_entry_rcu() (and making relevant changes such as
rcu_read_lock()/rcu_read_unlock()/synchronize_rcu()) will make the traverse 
safe.

Re: [PATCH 0/2 v8] oom: capture unreclaimable slab info in oom message

2017-09-28 Thread Tetsuo Handa

Yang Shi wrote:
> On 9/27/17 9:36 PM, Tetsuo Handa wrote:
> > On 2017/09/28 6:46, Yang Shi wrote:
> >> Changelog v7 -> v8:
> >> * Adopted Michal’s suggestion to dump unreclaim slab info when 
> >> unreclaimable slabs amount > total user memory. Not only in oom panic path.
> > 
> > Holding slab_mutex inside dump_unreclaimable_slab() was refrained since V2
> > because there are
> > 
> > mutex_lock(&slab_mutex);
> > kmalloc(GFP_KERNEL);
> > mutex_unlock(&slab_mutex);
> > 
> > users. If we call dump_unreclaimable_slab() for non OOM panic path, aren't 
> > we
> > introducing a risk of crash (i.e. kernel panic) for regular OOM path?
> 
> I don't see the difference between regular oom path and oom path other 
> than calling panic() at last.
> 
> And, the slab dump may be called by panic path too, it is for both 
> regular and panic path.

Calling a function that might cause kerneloops immediately before calling 
panic()
would be tolerable, for the kernel will panic after all. But calling a function
that might cause kerneloops when there is no plan to call panic() is a bug.

> 
> Thanks,
> Yang
> 
> > 
> > We can try mutex_trylock() from dump_unreclaimable_slab() at best.
> > But it is still remaining unsafe, isn't it?
> > 
>

Re: [PATCH 0/2 v8] oom: capture unreclaimable slab info in oom message

2017-09-27 Thread Tetsuo Handa

On 2017/09/28 6:46, Yang Shi wrote:
> Changelog v7 —> v8:
> * Adopted Michal’s suggestion to dump unreclaim slab info when unreclaimable 
> slabs amount > total user memory. Not only in oom panic path.

Holding slab_mutex inside dump_unreclaimable_slab() was refrained since V2
because there are

mutex_lock(&slab_mutex);
kmalloc(GFP_KERNEL);
mutex_unlock(&slab_mutex);

users. If we call dump_unreclaimable_slab() for non OOM panic path, aren't we
introducing a risk of crash (i.e. kernel panic) for regular OOM path?

We can try mutex_trylock() from dump_unreclaimable_slab() at best.
But it is still remaining unsafe, isn't it?

Re: [PATCH 3/3] mm: oom: show unreclaimable slab info when kernel panic

2017-09-15 Thread Tetsuo Handa

On 2017/09/15 2:14, Yang Shi wrote:
> @@ -1274,6 +1276,29 @@ static int slab_show(struct seq_file *m, void *p)
>   return 0;
>  }
>  
> +void show_unreclaimable_slab()
> +{
> + struct kmem_cache *s = NULL;
> + struct slabinfo sinfo;
> +
> + memset(&sinfo, 0, sizeof(sinfo));
> +
> + printk("Unreclaimable slabs:\n");
> + mutex_lock(&slab_mutex);

Please avoid sleeping locks which potentially depend on memory allocation.
There are

mutex_lock(&slab_mutex);
kmalloc(GFP_KERNEL);
mutex_unlock(&slab_mutex);

users which will fail to call panic() if they hit this path.

> + list_for_each_entry(s, &slab_caches, list) {
> + if (!is_root_cache(s))
> + continue;
> +
> + get_slabinfo(s, &sinfo);
> +
> + if (!is_reclaimable(s) && sinfo.num_objs > 0)
> + printk("%-17s %luKB\n", cache_name(s), K(sinfo.num_objs 
> * s->size));
> + }
> + mutex_unlock(&slab_mutex);
> +}
> +EXPORT_SYMBOL(show_unreclaimable_slab);
> +#undef K
> +
>  #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
>  void *memcg_slab_start(struct seq_file *m, loff_t *pos)
>  {
>

Re: [PATCH] mm: respect the __GFP_NOWARN flag when warning about stalls

2017-09-13 Thread Tetsuo Handa

Vlastimil Babka wrote:
> On 09/13/2017 03:54 PM, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> >> Let's see what others think about this.
> > 
> > Whether __GFP_NOWARN should warn about stalls is not a topic to discuss.
> 
> It is the topic of this thread, which tries to address a concrete
> problem somebody has experienced. In that context, the rest of your
> concerns seem to me not related to this problem, IMHO.

I suggested replacing warn_alloc() with safe/useful one rather than tweaking
warn_alloc() about __GFP_NOWARN.

> 
> > I consider warn_alloc() for reporting stalls is broken. It fails to provide
> > backtrace of stalling location. For example, OOM lockup with oom_lock held
> > cannot be reported by warn_alloc(). It fails to provide readable output when
> > called concurrently. For example, concurrent calls can cause printk()/
> > schedule_timeout_killable() lockup with oom_lock held. printk() offloading 
> > is
> > not an option, for there will be situations where printk() offloading cannot
> > be used (e.g. queuing via printk() is faster than writing to serial consoles
> > which results in unreadable logs due to log_bug overflow).

Re: [PATCH] mm: respect the __GFP_NOWARN flag when warning about stalls

2017-09-13 Thread Tetsuo Handa

Michal Hocko wrote:
> On Mon 11-09-17 19:36:59, Mikulas Patocka wrote:
> > 
> > 
> > On Mon, 11 Sep 2017, Michal Hocko wrote:
> > 
> > > On Mon 11-09-17 02:52:53, Mikulas Patocka wrote:
> > > > I am occasionally getting these warnings in khugepaged. It is an old 
> > > > machine with 550MHz CPU and 512 MB RAM.
> > > > 
> > > > Note that khugepaged has nice value 19, so when the machine is loaded 
> > > > with 
> > > > some work, khugepaged is stalled and this stall produces warning in the 
> > > > allocator.
> > > > 
> > > > khugepaged does allocations with __GFP_NOWARN, but the flag __GFP_NOWARN
> > > > is masked off when calling warn_alloc. This patch removes the masking of
> > > > __GFP_NOWARN, so that the warning is suppressed.
> > > > 
> > > > khugepaged: page allocation stalls for 10273ms, order:10, 
> > > > mode:0x4340ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM),
> > > >  nodemask=(null)
> > > > CPU: 0 PID: 3936 Comm: khugepaged Not tainted 4.12.3 #1
> > > > Hardware name: System Manufacturer Product Name/VA-503A, BIOS 4.51 PG 
> > > > 08/02/00
> > > > Call Trace:
> > > >  ? warn_alloc+0xb9/0x140
> > > >  ? __alloc_pages_nodemask+0x724/0x880
> > > >  ? arch_irq_stat_cpu+0x1/0x40
> > > >  ? detach_if_pending+0x80/0x80
> > > >  ? khugepaged+0x10a/0x1d40
> > > >  ? pick_next_task_fair+0xd2/0x180
> > > >  ? wait_woken+0x60/0x60
> > > >  ? kthread+0xcf/0x100
> > > >  ? release_pte_page+0x40/0x40
> > > >  ? kthread_create_on_node+0x40/0x40
> > > >  ? ret_from_fork+0x19/0x30
> > > > 
> > > > Signed-off-by: Mikulas Patocka 
> > > > Cc: sta...@vger.kernel.org
> > > > Fixes: 63f53dea0c98 ("mm: warn about allocations which stall for too 
> > > > long")
> > > 
> > > This patch hasn't introduced this behavior. It deliberately skipped
> > > warning on __GFP_NOWARN. This has been introduced later by 822519634142
> > > ("mm: page_alloc: __GFP_NOWARN shouldn't suppress stall warnings"). I
> > > disagreed [1] but overall consensus was that such a warning won't be
> > > harmful. Could you be more specific why do you consider it wrong,
> > > please?
> > 
> > I consider the warning wrong, because it warns when nothing goes wrong. 
> > I've got 7 these warnings for 4 weeks of uptime. The warnings typically 
> > happen when I run some compilation.
> > 
> > A process with low priority is expected to be running slowly when there's 
> > some high-priority process, so there's no need to warn that the 
> > low-priority process runs slowly.
> 
> I would tend to agree. It is certainly a noise in the log. And a kind of
> thing I was worried about when objecting the patch previously. 
>  
> > What else can be done to avoid the warning? Skip the warning if the 
> > process has lower priority?
> 
> No, I wouldn't play with priorities. Either we agree that NOWARN
> allocations simply do _not_warn_ or we simply explain users that some of
> those warnings might not be that critical and overloaded system might
> show them.
> 
> Let's see what others think about this.

Whether __GFP_NOWARN should warn about stalls is not a topic to discuss.
I consider warn_alloc() for reporting stalls is broken. It fails to provide
backtrace of stalling location. For example, OOM lockup with oom_lock held
cannot be reported by warn_alloc(). It fails to provide readable output when
called concurrently. For example, concurrent calls can cause printk()/
schedule_timeout_killable() lockup with oom_lock held. printk() offloading is
not an option, for there will be situations where printk() offloading cannot
be used (e.g. queuing via printk() is faster than writing to serial consoles
which results in unreadable logs due to log_bug overflow).

[4.14-rc1 x86] WARNING: kernel stack regs at f60bbb12 in swapper:1 has bad 'bp' value 0ba00000

2017-09-08 Thread Tetsuo Handa

Hello.

I'm seeing below error between
4898b99c261efe32 ("Merge tag 'acpi-4.13-rc7' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm") (git bisect 
good (presumably))
e6f3faa734a00c60 ("locking/lockdep: Fix workqueue crossrelease annotation") 
(git bisect bad) on linux.git .

# /usr/libexec/qemu-kvm -no-kvm -cpu kvm32 -smp 1 -m 2048 --no-reboot --kernel 
arch/x86/boot/bzImage --nographic --append "panic=1 console=ttyS0,115200n8"
[0.00] random: get_random_bytes called from start_kernel+0x33/0x3ab 
with crng_init=0
[0.00] Linux version 4.13.0-rc6+ (root@ccsecurity) (gcc version 4.8.5 
20150623 (Red Hat 4.8.5-16) (GCC)) #139 Fri Sep 8 21:53:41 JST 2017
[0.00] x86/fpu: x87 FPU will use FXSAVE
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x7fffdfff] usable
[0.00] BIOS-e820: [mem 0x7fffe000-0x7fff] reserved
[0.00] BIOS-e820: [mem 0xfffc-0x] reserved
[0.00] Notice: NX (Execute Disable) protection missing in CPU!
[0.00] random: fast init done
[0.00] SMBIOS 2.4 present.
[0.00] DMI: Red Hat KVM, BIOS 0.5.1 01/01/2011
[0.00] tsc: Fast TSC calibration using PIT
[0.00] e820: last_pfn = 0x7fffe max_arch_pfn = 0x10
[0.00] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- UC
[0.00] found SMP MP-table at [mem 0x000f7310-0x000f731f] mapped at 
[c00f7310]
[0.00] Scanning 1 areas for low memory corruption
[0.00] ACPI: Early table checksum verification disabled
[0.00] ACPI: RSDP 0x000F7170 14 (v00 BOCHS )
[0.00] ACPI: RSDT 0x7A9B 30 (v01 BOCHS  BXPCRSDT 
0001 BXPC 0001)
[0.00] ACPI: FACP 0x7177 74 (v01 BOCHS  BXPCFACP 
0001 BXPC 0001)
[0.00] ACPI: DSDT 0x7FFFE040 001137 (v01 BOCHS  BXPCDSDT 
0001 BXPC 0001)
[0.00] ACPI: FACS 0x7FFFE000 40
[0.00] ACPI: SSDT 0x71EB 000838 (v01 BOCHS  BXPCSSDT 
0001 BXPC 0001)
[0.00] ACPI: APIC 0x7A23 78 (v01 BOCHS  BXPCAPIC 
0001 BXPC 0001)
[0.00] 1160MB HIGHMEM available.
[0.00] 887MB LOWMEM available.
[0.00]   mapped low ram: 0 - 377fe000
[0.00]   low ram: 0 - 377fe000
[0.00] Zone ranges:
[0.00]   DMA  [mem 0x1000-0x00ff]
[0.00]   Normal   [mem 0x0100-0x377fdfff]
[0.00]   HighMem  [mem 0x377fe000-0x7fffdfff]
[0.00] Movable zone start for each node
[0.00] Early memory node ranges
[0.00]   node   0: [mem 0x1000-0x0009efff]
[0.00]   node   0: [mem 0x0010-0x7fffdfff]
[0.00] Initmem setup node 0 [mem 0x1000-0x7fffdfff]
[0.00] Using APIC driver default
[0.00] ACPI: PM-Timer IO Port: 0x608
[0.00] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
[0.00] IOAPIC[0]: apic_id 0 already used, trying 1
[0.00] IOAPIC[0]: apic_id 1, version 17, address 0xfec0, GSI 0-23
[0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
[0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
[0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
[0.00] Using ACPI (MADT) for SMP configuration information
[0.00] PM: Registered nosave memory: [mem 0x-0x0fff]
[0.00] PM: Registered nosave memory: [mem 0x0009f000-0x0009]
[0.00] PM: Registered nosave memory: [mem 0x000a-0x000e]
[0.00] PM: Registered nosave memory: [mem 0x000f-0x000f]
[0.00] e820: [mem 0x8000-0xfffb] available for PCI devices
[0.00] Booting paravirtualized kernel on bare hardware
[0.00] clocksource: refined-jiffies: mask: 0x max_cycles: 
0x, max_idle_ns: 1910969940391419 ns
[0.00] Built 1 zonelists in Zone order, mobility grouping on.  Total 
pages: 522412
[0.00] Kernel command line: panic=1 console=ttyS0,115200n8
[0.00] PID hash table entries: 4096 (order: 2, 16384 bytes)
[0.00] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
[0.00] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
[0.00] Initializing CPU#0
[0.00] Initializing HighMem for node 0 (000377fe:0007fffe)
[0.00] Initializing Movable for node 0 (000

Re: printk: what is going on with additional newlines?

2017-09-05 Thread Tetsuo Handa

Petr Mladek wrote:
> Some of these problems would be solved by a custom buffer.
> But you are right. There are less guarantees that it would
> get flushed or that it can be found in case of troubles.
> Now, I am not sure that it is a good idea to use it even
> for a single continuous line.
> 
> I wonder if all this is worth the effort, complexity, and risk.
> We are talking about cosmetic problems after all.
> 
> Well, what do you think about the extra printed information?
> For example:
> 
>message
> 
> It looks straightforward to me. These information
> might be helpful on its own. So, it might be a
> win-win solution.

Yes, if buffering multiple lines will not be implemented, I do want
printk context identifier field for each line. I think  
part will be something like TASK#pid (if outside interrupt) or
CPU#cpunum/#irqlevel (if inside interrupt).

Re: printk: what is going on with additional newlines?

2017-09-01 Thread Tetsuo Handa

Linus Torvalds wrote:
> On Tue, Aug 29, 2017 at 1:41 PM, Tetsuo Handa
>  wrote:
> >>
> >> A private buffer has none of those issues.
> >
> > Yes, I posted "[PATCH] printk: Add best-effort printk() buffering." at
> > http://lkml.kernel.org/r/1493560477-3016-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
> >  .
> 
> No, this is exactly what I *don't* want, because it takes over printk() 
> itself.
> 
> And that's problematic, because nesting happens for various reasons.
> 
> For example, you try to handle that nesting with printk_context(), and
> nothing when an interrupt happens.
> 
> But that is fundamentally broken.
> 
> Just to give an example: what if an interrupt happens, it does this
> buffering thing, then it gets interrupted by *another* interrupt, and
> now the printk from that other interrupt gets incorrectly nested
> together with the first one, because your "printk_context()" gives
> them the same context?

My assumption was that

  (1) task context can be preempted by soft IRQ context, hard IRQ context and 
NMI context.
  (2) soft IRQ context can be preempted by hard IRQ context and NMI context.
  (3) hard IRQ context can be preempted by NMI context.
  (4) An kernel-oops event can interrupt task context, soft IRQ context, hard 
IRQ context
  and NMI context, but the interrupted context can not continue execution of
  vprintk_default() after returning from the kernel-oops event even if the
  kernel-oops event occurred in schedulable context and panic_on_oops == 0.

and thus my "printk_context()" gives them different context.

But my assumption was wrong that

  soft IRQ context can be preempted by different soft IRQ context
  (e.g. SoftIRQ1 can be preempted by SoftIRQ2 while running
  handler for SoftIRQ1, and SoftIRQ2 can be preempted by SoftIRQ3
  while running handler for SoftIRQ2, and so on)

  hard IRQ context can be preempted by different hard IRQ context
  (e.g. HardIRQ1 can be preempted by HardIRQ2 while running
  handler for HardIRQ1, and HardIRQ2 can be preempted by HardIRQ3
  while running handler for HardIRQ2, and so on)

? Then, we need to recognize how many IRQs are nested.

I just tried to distinguish context using one "unsigned long" value
by embedding IRQ status into lower bits of "struct task_struct *".
I can change to distinguish context using multiple "unsigned long" values.

> 
> And it really doesn't have to even be interrupts. Look at what happens
> if you take a page fault in kernel space. Same exact deal. Both are
> sleeping contexts.

Is merging messages from outside a page fault and inside a page fault
so serious? That would happen only if memory access which might cause
a page fault occurs between get_printk_buffer() and put_printk_buffer(),
and I think that such user is rare.

> 
> So I really think that the only thing that knows what the "context" is
> is the person who does the printing. So if you want to create a
> printing buffer, it should be explicit. You allocate the buffer ahead
> of time (perhaps on the stack, possibly using actual allocations), and
> you use that explicit context.

If my assumption was wrong, isn't it dangerous from stack usage point of
view that we try to call kmalloc() (or allocate from stack memory) for
prbuf_init() for each nested level because it is theoretically possible
that a different IRQ jumps in while kmalloc() is in progress (or stack
memory is in use)?

> 
> Yes, it means that you don't do "printk()". You do an actual
> "buf_print()" or similar.
> 
>  Linus
> 

My worry is that there are so many functions which will need to receive
"struct seq_buf *" argument (from tail of __dump_stack() to head of
out_of_memory(), including e.g. cpuset_print_current_mems_allowed()) that
patches for passing "struct seq_buf *" argument becomes so large and
difficult to synchronize. I tried to pass such argument to relevant
functions before I propose "[PATCH] printk: Add best-effort printk()
buffering." patch, but I came to conclusion that passing such argument is
too complicated and too much bloat compared to merit.

If we teach printk subsystem that "I want to use buffering" via
get_printk_buffer(), we don't need to scatter around "struct seq_buf *"
argument throughout the kernel.

Using kmalloc() for prbuf_init() also introduces problems such as

  (a) we need to care about safe GFP flags (i.e. GFP_ATOMIC or
  GFP_KERNEL or something else which cannot be managed by
  current_gfp_context()) based on locking context

  (b) allocations can fail, and printing allocation failure messages
  when printing original messages is disturbing

  (c) allocation stall/failure messages are printed under

Re: [PATCH 1/2] mm,page_alloc: don't call __node_reclaim() without scoped allocation constraints.

2017-09-01 Thread Tetsuo Handa

Michal Hocko wrote:
> On Fri 01-09-17 21:40:07, Tetsuo Handa wrote:
> > We are doing the first allocation attempt before calling
> > current_gfp_context(). But since slab shrinker functions might depend on
> > __GFP_FS and/or __GFP_IO masking, calling slab shrinker functions from
> > node_reclaim() from get_page_from_freelist() without calling
> > current_gfp_context() has possibility of deadlock. Therefore, make sure
> > that the first memory allocation attempt does not call slab shrinker
> > functions.
> 
> But we do filter gfp_mask at __node_reclaim layer. Not really ideal from
> the readability point of view and maybe it could be cleaned up there
> shouldn't be any bug AFAICS. On the other hand we can save few cycles on
> the hot path that way and there are people who care about every cycle
> there and node reclaim is absolutely the last thing they care about.

Ah, indeed. We later do

struct scan_control sc = {
.gfp_mask = current_gfp_context(gfp_mask),
}

in __node_reclaim(). OK, there will be no problem.

[PATCH 1/2] mm,page_alloc: don't call __node_reclaim() without scoped allocation constraints.

2017-09-01 Thread Tetsuo Handa

We are doing the first allocation attempt before calling
current_gfp_context(). But since slab shrinker functions might depend on
__GFP_FS and/or __GFP_IO masking, calling slab shrinker functions from
node_reclaim() from get_page_from_freelist() without calling
current_gfp_context() has possibility of deadlock. Therefore, make sure
that the first memory allocation attempt does not call slab shrinker
functions.

Well, do we want to call node_reclaim() on the first allocation attempt?

If yes, I guess this patch will not be acceptable. But what is correct
flags passed to the first allocation attempt, for currently we ignore
gfp_allowed_mask masking for the first allocation attempt?

Maybe we can tolerate not calling node_reclaim() on the first allocation
attempt, for commit 31a6c1909f51dbe9 ("mm, page_alloc: set alloc_flags
only once in slowpath") says that the fastpath is trying to avoid the
cost of setting up alloc_flags precisely which sounds to me that falling
back to slowpath if node_reclaim() is needed is acceptable?

Signed-off-by: Tetsuo Handa 
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: David Rientjes 
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6dbc49e..20af138 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4189,7 +4189,8 @@ struct page *
finalise_ac(gfp_mask, order, &ac);
 
/* First allocation attempt */
-   page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
+   page = get_page_from_freelist(alloc_mask & ~__GFP_DIRECT_RECLAIM,
+ order, alloc_flags, &ac);
if (likely(page))
goto out;
 
-- 
1.8.3.1

[RFC PATCH 2/2] mm,sched: include memalloc info when printing debug dump of a task.

2017-09-01 Thread Tetsuo Handa

When analyzing memory allocation stalls, ability to cleanly take snapshots
for knowing how far progress is made is important, but warn_alloc() is too
problematic to obtain useful information. I have been proposing memory
allocation watchdog kernel thread [1], but nobody seems to be interested
in using ability to take snapshots cleanly.

  [1] 
http://lkml.kernel.org/r/1495331504-12480-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
 .

This patch is subset of the watchdog, which introduces a state tracking of
__GFP_DIRECT_RECLAIM memory allocations and prints MemAlloc: line to e.g.
SysRq-t output (like an example shown below).

--
[  222.194538] a.out   R  running task0  1976   1019 0x0080
[  222.197091] MemAlloc: a.out(1976) flags=0x400840 switches=286 seq=142 
gfp=0x1c2004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE) 
order=0 delay=175646
[  222.202617] Call Trace:
[  222.203987]  __schedule+0x23f/0x5d0
[  222.205624]  _cond_resched+0x2d/0x40
[  222.207271]  shrink_page_list+0x61/0xb70
[  222.208960]  shrink_inactive_list+0x24c/0x510
[  222.210794]  shrink_node_memcg+0x360/0x780
[  222.212556]  ? shrink_slab.part.44+0x239/0x2c0
[  222.214409]  shrink_node+0xdc/0x310
[  222.215994]  ? shrink_node+0xdc/0x310
[  222.217617]  do_try_to_free_pages+0xea/0x360
[  222.219416]  try_to_free_pages+0xbd/0xf0
[  222.221120]  __alloc_pages_slowpath+0x464/0xe50
[  222.222992]  ? io_schedule_timeout+0x19/0x40
[  222.224790]  __alloc_pages_nodemask+0x253/0x2c0
[  222.226677]  alloc_pages_current+0x65/0xd0
[  222.228427]  __page_cache_alloc+0x81/0x90
[  222.230301]  pagecache_get_page+0xa6/0x210
[  222.232388]  grab_cache_page_write_begin+0x1e/0x40
[  222.234333]  iomap_write_begin.constprop.17+0x56/0x110
[  222.236375]  iomap_write_actor+0x8f/0x160
[  222.238115]  ? iomap_write_begin.constprop.17+0x110/0x110
[  222.240219]  iomap_apply+0x9a/0x110
[  222.241813]  ? iomap_write_begin.constprop.17+0x110/0x110
[  222.243925]  iomap_file_buffered_write+0x69/0x90
[  222.245828]  ? iomap_write_begin.constprop.17+0x110/0x110
[  222.247968]  xfs_file_buffered_aio_write+0xb7/0x200 [xfs]
[  222.250108]  xfs_file_write_iter+0x8d/0x130 [xfs]
[  222.252045]  __vfs_write+0xef/0x150
[  222.253635]  vfs_write+0xb0/0x190
[  222.255184]  SyS_write+0x50/0xc0
[  222.256689]  do_syscall_64+0x62/0x170
[  222.258312]  entry_SYSCALL64_slow_path+0x25/0x25
--

--
[  406.416809] a.out   R  running task0  1976   1019 0x0080
[  406.416811] MemAlloc: a.out(1976) flags=0x400840 switches=440 seq=142 
gfp=0x1c2004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE) 
order=0 delay=359866
[  406.416811] Call Trace:
[  406.416813]  __schedule+0x23f/0x5d0
[  406.416814]  _cond_resched+0x2d/0x40
[  406.416816]  wait_iff_congested+0x73/0x100
[  406.416817]  ? wait_woken+0x80/0x80
[  406.416819]  shrink_inactive_list+0x36a/0x510
[  406.416820]  shrink_node_memcg+0x360/0x780
[  406.416822]  ? shrink_slab.part.44+0x239/0x2c0
[  406.416824]  shrink_node+0xdc/0x310
[  406.416825]  ? shrink_node+0xdc/0x310
[  406.416827]  do_try_to_free_pages+0xea/0x360
[  406.416828]  try_to_free_pages+0xbd/0xf0
[  406.416829]  __alloc_pages_slowpath+0x464/0xe50
[  406.416831]  ? io_schedule_timeout+0x19/0x40
[  406.416832]  __alloc_pages_nodemask+0x253/0x2c0
[  406.416834]  alloc_pages_current+0x65/0xd0
[  406.416835]  __page_cache_alloc+0x81/0x90
[  406.416837]  pagecache_get_page+0xa6/0x210
[  406.416838]  grab_cache_page_write_begin+0x1e/0x40
[  406.416839]  iomap_write_begin.constprop.17+0x56/0x110
[  406.416841]  iomap_write_actor+0x8f/0x160
[  406.416842]  ? iomap_write_begin.constprop.17+0x110/0x110
[  406.416844]  iomap_apply+0x9a/0x110
[  406.416845]  ? iomap_write_begin.constprop.17+0x110/0x110
[  406.416846]  iomap_file_buffered_write+0x69/0x90
[  406.416848]  ? iomap_write_begin.constprop.17+0x110/0x110
[  406.416857]  xfs_file_buffered_aio_write+0xb7/0x200 [xfs]
[  406.416866]  xfs_file_write_iter+0x8d/0x130 [xfs]
[  406.416867]  __vfs_write+0xef/0x150
[  406.416869]  vfs_write+0xb0/0x190
[  406.416870]  SyS_write+0x50/0xc0
[  406.416871]  do_syscall_64+0x62/0x170
[  406.416872]  entry_SYSCALL64_slow_path+0x25/0x25
--

But this patch can do more than that. This patch provides administrators
ability to take administrator-controlled actions based on some threshold
by polling like khungtaskd kernel thread using e.g. SystemTap script, for
this patch can record the timestamp of beginning of memory allocation
without risk of overflowing array capacity and/or skipping probes.

This patch traces at the page fault handler and the mempool allocator
in addition to slowpath of the page allocator, for the page fault handler
does sleeping operations other than calling the page allocator and the
mempool allocator fails to track accumulated delay due to __GFP_NORETRY
and fastpath of the page allocator does not use __GFP_DIRECT_RECLAIM.

Signed-off-by: Tetsuo Handa 
Cc

Re: printk: what is going on with additional newlines?

2017-08-29 Thread Tetsuo Handa

Linus Torvalds wrote:
> On Tue, Aug 29, 2017 at 10:00 AM, Linus Torvalds
>  wrote:
> >
> > I refuse to help those things. We mis-designed things
> 
> Actually, let me rephrase that:
> 
> It might actually be a good idea to help those things, by making
> helper functions available that do the marshalling.
> 
> So not calling "printk()" directly, but having a set of simple
> "buffer_print()" functions where each user has its own buffer, and
> then the "buffer_print()" functions will help people do nicely output
> data.
> 
> So if the issue is that people want to print (for example) hex dumps
> one character at a time, but don't want to have each character show up
> on a line of their own, I think we might well add a few functions to
> help dop that.
> 
> But they wouldn't be "printk". They would be the buffering functions
> that then call printk when tyhey have buffered a line.
> 
> That avoids the whole nasty issue with printk - printk wants to show
> stuff early (because _maybe_ it's critical) and printk wants to make
> log records with timestamps and loglevels. And printk has serious
> locking issues that are really nasty and fundamental.
> 
> A private buffer has none of those issues.

Yes, I posted "[PATCH] printk: Add best-effort printk() buffering." at
http://lkml.kernel.org/r/1493560477-3016-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
 .

> 
>  Linus
>

Re: [PATCH] mm, memory_hotplug: do not back off draining pcp free pages from kworker context

2017-08-29 Thread Tetsuo Handa

On 2017/08/29 7:33, Andrew Morton wrote:
> On Mon, 28 Aug 2017 11:33:41 +0200 Michal Hocko  wrote:
> 
>> drain_all_pages backs off when called from a kworker context since
>> 0ccce3b924212 ("mm, page_alloc: drain per-cpu pages from workqueue
>> context") because the original IPI based pcp draining has been replaced
>> by a WQ based one and the check wanted to prevent from recursion and
>> inter workers dependencies. This has made some sense at the time
>> because the system WQ has been used and one worker holding the lock
>> could be blocked while waiting for new workers to emerge which can be a
>> problem under OOM conditions.
>>
>> Since then ce612879ddc7 ("mm: move pcp and lru-pcp draining into single
>> wq") has moved draining to a dedicated (mm_percpu_wq) WQ with a rescuer
>> so we shouldn't depend on any other WQ activity to make a forward
>> progress so calling drain_all_pages from a worker context is safe as
>> long as this doesn't happen from mm_percpu_wq itself which is not the
>> case because all workers are required to _not_ depend on any MM locks.
>>
>> Why is this a problem in the first place? ACPI driven memory hot-remove
>> (acpi_device_hotplug) is executed from the worker context. We end
>> up calling __offline_pages to free all the pages and that requires
>> both lru_add_drain_all_cpuslocked and drain_all_pages to do their job
>> otherwise we can have dangling pages on pcp lists and fail the offline
>> operation (__test_page_isolated_in_pageblock would see a page with 0
>> ref. count but without PageBuddy set).
>>
>> Fix the issue by removing the worker check in drain_all_pages.
>> lru_add_drain_all_cpuslocked doesn't have this restriction so it works
>> as expected.
>>
>> Fixes: 0ccce3b924212 ("mm, page_alloc: drain per-cpu pages from workqueue 
>> context")
>> Signed-off-by: Michal Hocko 
> 
> No cc:stable?
> 

Michal, are you sure that this patch does not cause deadlock?

As shown in "[PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq." thread, currently 
work
items on mm_percpu_wq seem to be blocked by other work items not on 
mm_percpu_wq.

Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-11 Thread Tetsuo Handa

Michal Hocko wrote:
> On Fri 11-08-17 16:54:36, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > > > Will you explain the mechanism why random values are written instead of 
> > > > zeros
> > > > so that this patch can actually fix the race problem?
> > > 
> > > I am not sure what you mean here. Were you able to see a write with an
> > > unexpected content?
> > 
> > Yes. See 
> > http://lkml.kernel.org/r/201708072228.faj09347.toovoffqjsh...@i-love.sakura.ne.jp
> >  .
> 
> Ahh, I've missed that random part of your output. That is really strange
> because AFAICS the oom reaper shouldn't really interact here. We are
> only unmapping anonymous memory and even if a refault slips through we
> should always get zeros.
> 
> Your test case doesn't mmap MAP_PRIVATE of a file so we shouldn't even
> get any uninitialized data from a file by missing CoWed content. The
> only possible explanations would be that a page fault returned a
> non-zero data which would be a bug on its own or that a file write
> extend the file without actually writing to it which smells like a fs
> bug to me.

As I wrote at 
http://lkml.kernel.org/r/201708112053.fig52141.thjsoqflofm...@i-love.sakura.ne.jp
 ,
I don't think it is a fs bug.

> 
> Anyway I wasn't able to reproduce this and I was running your usecase
> in the loop for quite some time (with xfs storage). How reproducible
> is this? If you can reproduce easily can you simply comment out
> unmap_page_range in __oom_reap_task_mm and see if that makes any change
> just to be sure that the oom reaper can be ruled out?

Frequency of writing not-zero values is lower than frequency of writing zero 
values.
But if I comment out unmap_page_range() in __oom_reap_task_mm(), I can't even
reproduce writing zero values. As far as I tested, writing not-zero values 
occurs
only if the OOM reaper is involved.

Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-11 Thread Tetsuo Handa

Andrea Arcangeli wrote:
> On Fri, Aug 11, 2017 at 12:22:56PM +0200, Andrea Arcangeli wrote:
> > disk block? This would happen on ext4 as well if mounted with -o
> > journal=data instead of -o journal=ordered in fact, perhaps you simply
> 
> Oops above I meant journal=writeback, journal=data is even stronger
> than journal=ordered of course.
> 
> And I shall clarify further that old disk content can only showup
> legitimately on journal=writeback after a hard reboot or crash or in
> general an unclean unmount. Even if there's no journaling at all
> (i.e. ext2/vfat) old disk content cannot be shown at any given time no
> matter what if there's no unclean unmount that requires a journal
> reply.

I'm using XFS on a small non-NUMA system (4 CPUs / 4096MB RAM).

  /dev/sda1 / xfs rw,relatime,attr2,inode64,noquota 0 0

As far as I tested, not-zero not-0xff values did not show up with 4.6.7
kernel (i.e. all not-0xff bytes are zero) while not-zero not-0xff values
show up with 4.13.0-rc4-next-20170811 kernel.

> 
> This theory of a completely unrelated fs bug showing you disk content
> as result of the OOM reaper induced SIGBUS interrupting a
> copy_from_user at its very start, is purely motivated by the fact like
> Michal I didn't see much explanation on the VM side that could cause
> those not-zero not-0xff values showing up in the buffer of the write
> syscall. You can try to change fs and see if it happens again to rule
> it out. If it always happens regardless of the filesystem used, then
> it's likely not a fs bug of course. You've got an entire and aligned
> 4k fs block showing up that data.
> 

What is strange is that, as far as I tested, the pattern of not-zero not-0xff
bytes seems to be always the same. Such thing unlikely happens if old content
on the disk is by chance showing up. Maybe the content written is not random
but specific 4096 bytes of memory image of executable file.

$ cat checker.c
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
char buffer2[64] = { };
int ret = 0;
int i;
for (i = 0; i < 1024; i++) {
 int flag = 0;
 int fd;
 unsigned int byte[256];
 int j;
 snprintf(buffer2, sizeof(buffer2), "/tmp/file.%u", i);
 fd = open(buffer2, O_RDONLY);
 if (fd == EOF)
 continue;
 memset(byte, 0, sizeof(byte));
 while (1) {
 static unsigned char buffer[1048576];
 int len = read(fd, (char *) buffer, sizeof(buffer));
 if (len <= 0)
 break;
 for (j = 0; j < len; j++)
 if (buffer[j] != 0xFF)
 byte[buffer[j]]++;
 }
 close(fd);
 for (j = 0; j < 255; j++)
 if (byte[j]) {
 printf("ERROR: %u %u in %s\n", byte[j], j, 
buffer2);
 flag = 1;
 }
 if (flag == 0)
 unlink(buffer2);
 else
 ret = 1;
}
return ret;
}
$ uname -r
4.13.0-rc4-next-20170811
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 4096 0 in /tmp/file.4
$ /bin/rm /tmp/file.4
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 4096 0 in /tmp/file.6
$ /bin/rm /tmp/file.6
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 4096 0 in /tmp/file.0
$ /bin/rm /tmp/file.0
$ while ./checker; do echo start; ./a.out ; echo end; done
start
Killed
end
start
Killed
end
start
Killed
end
start
Killed
end
ERROR: 2549 0 in /tmp/file.4
ERROR: 40 1 in /tmp/file.4
ERROR: 53 2 in /tmp/file.4
ERROR: 29 3 in /tmp/file.4
ERROR: 27 4 in /tmp/file.4
ERROR: 5 5 in /tmp/file.4
ERROR: 14 6 in /tmp/file.4
ERROR: 8 7 in /tmp/file.4
ERROR: 16 8 in /tmp/file.4
ERROR: 4 9 in /tmp/file.4
ERROR: 12 10 in /tmp/file.4
ERROR: 4 11 in /tmp/file.4
ERROR: 2 12 in /tmp/file.4
ERROR: 10 13 in /tmp/file.4
ERROR: 13 14 in /tmp/file.4
ERROR: 4 15 in /tmp/file.4
ERROR: 26 16 in /tmp/file.4
ERROR: 5 17 in /tmp/file.4
ERROR: 23 18 in /tmp/file.4
ERROR: 4 19 in /tmp/file.4
ERROR: 8 20 in /tmp/file.4
ERROR: 2 21 in /tmp/file.4
ERROR: 1 22 in /tmp/file.4
ERROR: 2 23 in /tmp/file.4
ERROR: 17 24 in /tmp/file.4
ERROR: 5 25 in /tmp/file.4
ERROR: 2 26 in /tmp/file.4
ERROR: 1 27 in /tmp/file.4
ERROR: 3 28 in /tmp/file.4
ERROR: 17 32 in /tmp/file.4
ERROR: 1 35 in

Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-11 Thread Tetsuo Handa

Michal Hocko wrote:
> On Fri 11-08-17 11:28:52, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > +/*
> > > + * Checks whether a page fault on the given mm is still reliable.
> > > + * This is no longer true if the oom reaper started to reap the
> > > + * address space which is reflected by MMF_UNSTABLE flag set in
> > > + * the mm. At that moment any !shared mapping would lose the content
> > > + * and could cause a memory corruption (zero pages instead of the
> > > + * original content).
> > > + *
> > > + * User should call this before establishing a page table entry for
> > > + * a !shared mapping and under the proper page table lock.
> > > + *
> > > + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> > > + */
> > > +static inline int check_stable_address_space(struct mm_struct *mm)
> > > +{
> > > + if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> > > + return VM_FAULT_SIGBUS;
> > > + return 0;
> > > +}
> > > +
> > 
> > Will you explain the mechanism why random values are written instead of 
> > zeros
> > so that this patch can actually fix the race problem?
> 
> I am not sure what you mean here. Were you able to see a write with an
> unexpected content?

Yes. See 
http://lkml.kernel.org/r/201708072228.faj09347.toovoffqjsh...@i-love.sakura.ne.jp
 .

Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-10 Thread Tetsuo Handa

Michal Hocko wrote:
> +/*
> + * Checks whether a page fault on the given mm is still reliable.
> + * This is no longer true if the oom reaper started to reap the
> + * address space which is reflected by MMF_UNSTABLE flag set in
> + * the mm. At that moment any !shared mapping would lose the content
> + * and could cause a memory corruption (zero pages instead of the
> + * original content).
> + *
> + * User should call this before establishing a page table entry for
> + * a !shared mapping and under the proper page table lock.
> + *
> + * Return 0 when the PF is safe VM_FAULT_SIGBUS otherwise.
> + */
> +static inline int check_stable_address_space(struct mm_struct *mm)
> +{
> + if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags)))
> + return VM_FAULT_SIGBUS;
> + return 0;
> +}
> +

Will you explain the mechanism why random values are written instead of zeros
so that this patch can actually fix the race problem? I consider that writing
random values (though it seems like portion of process image) instead of zeros
to a file might cause a security problem, and the patch that fixes it should be
able to be backported to stable kernels.

Re: Re: [PATCH] oom_reaper: close race without using oom_lock

2017-08-10 Thread Tetsuo Handa

Michal Hocko wrote:
> On Thu 10-08-17 21:10:30, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 08-08-17 11:14:50, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Sat 05-08-17 10:02:55, Tetsuo Handa wrote:
> > > > > > Michal Hocko wrote:
> > > > > > > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> > > > > > > > My question is, how can users know it if somebody was 
> > > > > > > > OOM-killed needlessly
> > > > > > > > by allowing MMF_OOM_SKIP to race.
> > > > > > > 
> > > > > > > Is it really important to know that the race is due to 
> > > > > > > MMF_OOM_SKIP?
> > > > > > 
> > > > > > Yes, it is really important. Needlessly selecting even one OOM 
> > > > > > victim is
> > > > > > a pain which is difficult to explain to and persuade some of 
> > > > > > customers.
> > > > > 
> > > > > How is this any different from a race with a task exiting an releasing
> > > > > some memory after we have crossed the point of no return and will kill
> > > > > something?
> > > > 
> > > > I'm not complaining about an exiting task releasing some memory after 
> > > > we have
> > > > crossed the point of no return.
> > > > 
> > > > What I'm saying is that we can postpone "the point of no return" if we 
> > > > ignore
> > > > MMF_OOM_SKIP for once (both this "oom_reaper: close race without using 
> > > > oom_lock"
> > > > thread and "mm, oom: task_will_free_mem(current) should ignore 
> > > > MMF_OOM_SKIP for
> > > > once." thread). These are race conditions we can avoid without crystal 
> > > > ball.
> > > 
> > > If those races are really that common than we can handle them even
> > > without "try once more" tricks. Really this is just an ugly hack. If you
> > > really care then make sure that we always try to allocate from memory
> > > reserves before going down the oom path. In other words, try to find a
> > > robust solution rather than tweaks around a problem.
> > 
> > Since your "mm, oom: allow oom reaper to race with exit_mmap" patch removes
> > oom_lock serialization from the OOM reaper, possibility of calling 
> > out_of_memory()
> > due to successful mutex_trylock(&oom_lock) would increase when the OOM 
> > reaper set
> > MMF_OOM_SKIP quickly.
> > 
> > What if task_is_oom_victim(current) became true and MMF_OOM_SKIP was set
> > on current->mm between after __gfp_pfmemalloc_flags() returned 0 and before
> > out_of_memory() is called (due to successful mutex_trylock(&oom_lock)) ?
> > 
> > Excuse me? Are you suggesting to try memory reserves before
> > task_is_oom_victim(current) becomes true?
> 
> No what I've tried to say is that if this really is a real problem,
> which I am not sure about, then the proper way to handle that is to
> attempt to allocate from memory reserves for an oom victim. I would be
> even willing to take the oom_lock back into the oom reaper path if the
> former turnes out to be awkward to implement. But all this assumes this
> is a _real_ problem.

Aren't we back to square one? My question is, how can users know it if
somebody was OOM-killed needlessly by allowing MMF_OOM_SKIP to race.

You don't want to call get_page_from_freelist() from out_of_memory(), do you?
But without passing a flag "whether get_page_from_freelist() with memory 
reserves
was already attempted if current thread is an OOM victim" to 
task_will_free_mem()
in out_of_memory() and a flag "whether get_page_from_freelist() without memory
reserves was already attempted if current thread is not an OOM victim" to
test_bit(MMF_OOM_SKIP) in oom_evaluate_task(), we won't be able to know
if somebody was OOM-killed needlessly by allowing MMF_OOM_SKIP to race.

Will you accept passing such flags (something like incomplete patch shown 
below) ?

--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -35,6 +35,8 @@ struct oom_control {
 */
const int order;
 
+   const bool reserves_tried;
+
/* Used by oom implementation, do not set */
unsigned long totalpages;
struct task_struct *chosen;
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -303,8 +303,10 @@ static int oom_evaluate_task(struct task_struct *task, 
void *arg)
 * any memory is quite low.
 */
if (!is_sysrq_oo

Re: Re: [PATCH] oom_reaper: close race without using oom_lock

2017-08-10 Thread Tetsuo Handa

Michal Hocko wrote:
> On Tue 08-08-17 11:14:50, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Sat 05-08-17 10:02:55, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> > > > > > My question is, how can users know it if somebody was OOM-killed 
> > > > > > needlessly
> > > > > > by allowing MMF_OOM_SKIP to race.
> > > > > 
> > > > > Is it really important to know that the race is due to MMF_OOM_SKIP?
> > > > 
> > > > Yes, it is really important. Needlessly selecting even one OOM victim is
> > > > a pain which is difficult to explain to and persuade some of customers.
> > > 
> > > How is this any different from a race with a task exiting an releasing
> > > some memory after we have crossed the point of no return and will kill
> > > something?
> > 
> > I'm not complaining about an exiting task releasing some memory after we 
> > have
> > crossed the point of no return.
> > 
> > What I'm saying is that we can postpone "the point of no return" if we 
> > ignore
> > MMF_OOM_SKIP for once (both this "oom_reaper: close race without using 
> > oom_lock"
> > thread and "mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP 
> > for
> > once." thread). These are race conditions we can avoid without crystal ball.
> 
> If those races are really that common than we can handle them even
> without "try once more" tricks. Really this is just an ugly hack. If you
> really care then make sure that we always try to allocate from memory
> reserves before going down the oom path. In other words, try to find a
> robust solution rather than tweaks around a problem.

Since your "mm, oom: allow oom reaper to race with exit_mmap" patch removes
oom_lock serialization from the OOM reaper, possibility of calling 
out_of_memory()
due to successful mutex_trylock(&oom_lock) would increase when the OOM reaper 
set
MMF_OOM_SKIP quickly.

What if task_is_oom_victim(current) became true and MMF_OOM_SKIP was set
on current->mm between after __gfp_pfmemalloc_flags() returned 0 and before
out_of_memory() is called (due to successful mutex_trylock(&oom_lock)) ?

Excuse me? Are you suggesting to try memory reserves before
task_is_oom_victim(current) becomes true?

> 
> [...]
> > > Yes that is possible. Once you are in the shrinker land then you have to
> > > count with everything. And if you want to imply that
> > > get_page_from_freelist inside __alloc_pages_may_oom may lockup while
> > > holding the oom_lock then you might be right but I haven't checked that
> > > too deeply. It might be very well possible that the node reclaim bails
> > > out early when we are under OOM.
> > 
> > Yes, I worry that get_page_from_freelist() with oom_lock held might lockup.
> > 
> > If we are about to invoke the OOM killer for the first time, it is likely 
> > that
> > __node_reclaim() finds nothing to reclaim and will bail out immediately. 
> > But if
> > we are about to invoke the OOM killer again, it is possible that small 
> > amount of
> > memory was reclaimed by the OOM killer/reaper, and all reclaimed memory was 
> > assigned
> > to things which __node_reclaim() will find and try to reclaim, and any 
> > thread which
> > took oom_lock will call __node_reclaim() and __node_reclaim() find something
> > reclaimable if __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation is 
> > involved.
> > 
> > We should consider such situation volatile (i.e. should not make assumption 
> > that
> > get_page_from_freelist() with oom_lock held shall bail out immediately) if 
> > shrinkers
> > which (directly or indirectly) involve __GFP_DIRECT_RECLAIM && 
> > !__GFP_NORETRY memory
> > allocation are permitted.
> 
> Well, I think you are so focused on details that you most probably miss
> a large picture here. Just think about the purpose of the node reclaim.
> It is there to _prefer_ local allocations than go to a distant NUMA
> node. So rather than speculating about details maybe it makes sense to
> consider whether it actually makes any sense to even try to node reclaim
> when we are OOM. In other words why to do an additional reclaim when we
> just found out that all reclaim attempts have failed...

Below is what I will propose if there is possibility of lockup.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index be5bd60..718b2e7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3271,9 +

Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-08 Thread Tetsuo Handa

Andrea Arcangeli wrote:
> Overall OOM killing to me was reliable also before the oom reaper was
> introduced.

I don't think so. We spent a lot of time in order to remove possible locations
which can lead to failing to invoke the OOM killer when out_of_memory() is 
called.

> 
> I just did a search in bz for RHEL7 and there's a single bugreport
> related to OOM issues but it's hanging in a non-ext4 filesystem, and
> not nested in alloc_pages (but in wait_for_completion) and it's not
> reproducible with ext4. And it's happening only in an artificial
> specific "eatmemory" stress test from QA, there seems to be zero
> customer related bugreports about OOM hangs.

Since RHEL7 changed default filesystem from ext4 to xfs, OOM related problems
became much easier to occur, for xfs involves many kernel threads where
TIF_MEMDIE based access to memory reserves cannot work among relevant threads.

Judging from my experience at a support center, it is too difficult for 
customers
to report OOM hangs. It requires customers to stand by in front of the console
twenty-four seven so that we get SysRq-t etc. whenever an OOM related problem is
suspected. We can't ask customers for such effort. There is no report does not 
mean
OOM hang is not occurring without artificial memory stress tests.

> 
> A couple of years ago I could trivially trigger OOM deadlocks on
> various ext4 paths that loops or use GFP_NOFAIL, but that was just a
> matter of letting GFP_NOIO/NOFS/NOFAIL kind of allocation go through
> memory reserves below the low watermark.
> 
> It is also fine to kill a few more processes in fact. It's not the end
> of the world if two tasks are killed because the first one couldn't
> reach exit_mmap without oom reaper assistance. The fs kind of OOM
> hangs in kernel threads are major issues if the whole filesystem in
> the journal or something tends to prevent a multitude of tasks to
> handle SIGKILL, so it has to be handled with reserves and it looked
> like it was working fine already.
> 
> The main point of the oom reaper nowadays is to free memory fast
> enough so a second task isn't killed as a false positive, but it's not
> like anybody will notice much of a difference if a second task is
> killed, it wasn't commonly happening either.

The OOM reaper does not need to free memory fast enough, for the OOM killer
does not select the second task for kill until the OOM reaper sets
MMF_OOM_SKIP or __mmput() sets MMF_OOM_SKIP.

I think that the main point of the OOM reaper nowadays are that
"how can we allow the OOM reaper to take mmap_sem for read (because
khugepaged might take mmap_sem of the OOM victim for write)"

--
[  493.787997] Out of memory: Kill process 3163 (a.out) score 739 or sacrifice 
child
[  493.791708] Killed process 3163 (a.out) total-vm:4268108kB, 
anon-rss:2754236kB, file-rss:0kB, shmem-rss:0kB
[  494.838382] oom_reaper: unable to reap pid:3163 (a.out)
[  494.847768] 
[  494.847768] Showing all locks held in the system:
[  494.861357] 1 lock held by oom_reaper/59:
[  494.865903]  #0:  (tasklist_lock){.+.+..}, at: [] 
debug_show_all_locks+0x3d/0x1a0
[  494.872934] 1 lock held by khugepaged/63:
[  494.877426]  #0:  (&mm->mmap_sem){++}, at: [] 
khugepaged+0x99d/0x1af0
[  494.884165] 3 locks held by kswapd0/75:
[  494.888628]  #0:  (shrinker_rwsem){..}, at: [] 
shrink_slab.part.44+0x48/0x2b0
[  494.894125]  #1:  (&type->s_umount_key#30){++}, at: [] 
trylock_super+0x16/0x50
[  494.898328]  #2:  (&pag->pag_ici_reclaim_lock){+.+.-.}, at: 
[] xfs_reclaim_inodes_ag+0x3ad/0x4d0 [xfs]
[  494.902703] 3 locks held by kworker/u128:31/387:
[  494.905404]  #0:  ("writeback"){.+.+.+}, at: [] 
process_one_work+0x1fc/0x480
[  494.909237]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [] 
process_one_work+0x1fc/0x480
[  494.913205]  #2:  (&type->s_umount_key#30){++}, at: [] 
trylock_super+0x16/0x50
[  494.916954] 1 lock held by xfsaild/sda1/422:
[  494.919288]  #0:  (&xfs_nondir_ilock_class){--}, at: 
[] xfs_ilock_nowait+0x148/0x240 [xfs]
[  494.923470] 1 lock held by systemd-journal/491:
[  494.926102]  #0:  (&(&ip->i_mmaplock)->mr_lock){++}, at: 
[] xfs_ilock+0x11a/0x1b0 [xfs]
[  494.929942] 1 lock held by gmain/745:
[  494.932368]  #0:  (&(&ip->i_mmaplock)->mr_lock){++}, at: 
[] xfs_ilock+0x11a/0x1b0 [xfs]
[  494.936505] 1 lock held by tuned/1009:
[  494.938856]  #0:  (&(&ip->i_mmaplock)->mr_lock){++}, at: 
[] xfs_ilock+0x11a/0x1b0 [xfs]
[  494.942824] 2 locks held by agetty/982:
[  494.944900]  #0:  (&tty->ldisc_sem){.+}, at: [] 
ldsem_down_read+0x1f/0x30
[  494.948244]  #1:  (&ldata->atomic_read_lock){+.+...}, at: 
[] n_tty_read+0xbf/0x8e0
[  494.952118] 1 lock held by sendmail/984:
[  494.954408]  #0:  (&(&ip->i_mmaplock)->mr_lock){++}, at: 
[] xfs_ilock+0x11a/0x1b0 [xfs]
[  494.958370] 5 locks held by a.out/3163:
[  494.960544]  #0:  (&mm->mmap_sem){++}, at: [] 
__do_page_fault+0x154/0x4c0
[  494.964191]  #1:  (shrinker_rwsem){..}, at: [] 
shrink_slab.part.44+0x4

Re: [PATCH 0/2] mm, oom: fix oom_reaper fallouts

2017-08-07 Thread Tetsuo Handa

Michal Hocko wrote:
> On Mon 07-08-17 22:28:27, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > Hi,
> > > there are two issues this patch series attempts to fix. First one is
> > > something that has been broken since MMF_UNSTABLE flag introduction
> > > and I guess we should backport it stable trees (patch 1). The other
> > > issue has been brought up by Wenwei Tao and Tetsuo Handa has created
> > > a test case to trigger it very reliably. I am not yet sure this is a
> > > stable material because the test case is rather artificial. If there is
> > > a demand for the stable backport I will prepare it, of course, though.
> > > 
> > > I hope I've done the second patch correctly but I would definitely
> > > appreciate some more eyes on it. Hence CCing Andrea and Kirill. My
> > > previous attempt with some more context was posted here
> > > http://lkml.kernel.org/r/20170803135902.31977-1-mho...@kernel.org
> > > 
> > > My testing didn't show anything unusual with these two applied on top of
> > > the mmotm tree.
> > 
> > I really don't like your likely/unlikely speculation.
> 
> Have you seen any non artificial workload triggering this?

It will be 5 to 10 years away from now to know whether non artificial
workload triggers this. (I mean, customers start using RHEL8.)

>Look, I am
> not going to argue about how likely this is or not. I've said I am
> willing to do backports if there is a demand but please do realize that
> this is not a trivial change to backport pre 4.9 kernels would require
> MMF_UNSTABLE to be backported as well. This all can be discussed
> after the merge so can we focus on the review now rather than any
> distractions?

3f70dc38cec2 was not working as expected. Nobody tested that OOM situation.
Then, I think we can revert 3f70dc38cec2, and then make it possible to uniformly
apply MMF_UNSTABLE to all 4.6+ kernels.

> 
> Also please note that while writing zeros is certainly bad any integrity
> assumptions are basically off when an application gets killed
> unexpectedly while performing an IO.

I consider unexpectedly saving process image (instead of zeros) to a file
is similar to fs.suid_dumpable problem (i.e. could cause a security problem).
I do expect that this patch is backported to RHEL8 (I don't know which version
RHEL8 will choose, but I guess it will be between 4.6 and 4.13).

Re: [PATCH 0/2] mm, oom: fix oom_reaper fallouts

2017-08-07 Thread Tetsuo Handa

Michal Hocko wrote:
> Hi,
> there are two issues this patch series attempts to fix. First one is
> something that has been broken since MMF_UNSTABLE flag introduction
> and I guess we should backport it stable trees (patch 1). The other
> issue has been brought up by Wenwei Tao and Tetsuo Handa has created
> a test case to trigger it very reliably. I am not yet sure this is a
> stable material because the test case is rather artificial. If there is
> a demand for the stable backport I will prepare it, of course, though.
> 
> I hope I've done the second patch correctly but I would definitely
> appreciate some more eyes on it. Hence CCing Andrea and Kirill. My
> previous attempt with some more context was posted here
> http://lkml.kernel.org/r/20170803135902.31977-1-mho...@kernel.org
> 
> My testing didn't show anything unusual with these two applied on top of
> the mmotm tree.

I really don't like your likely/unlikely speculation.
I can trigger this problem with 8 threads using 4.13.0-rc2-next-20170728.
The written data can be random values (seems portion of a.out memory image).
I guess that unexpected information leakage is possible.

--
$ cat 0804.c
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define NUMTHREADS 8
#define STACKSIZE 8192

static int pipe_fd[2] = { EOF, EOF };
static int file_writer(void *i)
{
static char buffer[1048576];
int fd;
char buffer2[64] = { };
snprintf(buffer2, sizeof(buffer2), "/tmp/file.%lu", (unsigned long) i);
fd = open(buffer2, O_WRONLY | O_CREAT | O_APPEND, 0600);
memset(buffer, 0xFF, sizeof(buffer));
read(pipe_fd[0], buffer, 1);
while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer));
return 0;
}

int main(int argc, char *argv[])
{
char *buf = NULL;
unsigned long size;
unsigned long i;
char *stack;
if (pipe(pipe_fd))
return 1;
stack = malloc(STACKSIZE * NUMTHREADS);
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
for (i = 0; i < NUMTHREADS; i++)
if (clone(file_writer, stack + (i + 1) * STACKSIZE,
  CLONE_THREAD | CLONE_SIGHAND | CLONE_VM | CLONE_FS |
  CLONE_FILES, (void *) i) == -1)
break;
close(pipe_fd[1]);
/* Will cause OOM due to overcommit; if not use SysRq-f */
for (i = 0; i < size; i += 4096)
buf[i] = 0;
kill(-1, SIGKILL);
return 0;
}
$ gcc -Wall -O3 0804.c
$ while :; do ./a.out; cat /tmp/file.* | od -b; /bin/rm /tmp/file.*; done
Killed
000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
641524
Killed
000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
446100
Killed
000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
376250
Killed
000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
434736
Killed
000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
645130
Killed
000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
356401 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
356402 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
505540
Killed
000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
544662
Killed
000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
620347
Killed
000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
412756
Killed
000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
522417
Killed
000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
561245
Killed
000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
476642 055 061 061 051 000 000 056 163 171 155 164 141 142 000 056 163
4766420020 164 162 164 141 142 000 056 163 150 163 164 162 164 141 142 000
4766420040 056 151 156 164 145 162 160 000 056 156 157 164 145 056 101 102
4766420060 111 055 164 141 147 000 056 156 157 164 145 056 147 156 165 056
4766420100 142 165 151 154 144 055 151 144 000 056 147 156 165 056 150 141
4766420120 163 150 000 056 144 171 156 163 171 155 000 056 144 171 156 163
4766420140 164 162 000 056 147 156 165 056 166 145 162 163 151 157 156 000
4766420160 056 147 156 165 056 166 145 162 163 151 157 156 137 162 000 056
4766420200 162 145 154 141 056 144 171 156 000 056 162 145 154 141 056 160
4766420220 154 164 000 056 151 156 151 164 000 056 164 145 170 164 000 056
476642024

Re: [PATCH] oom_reaper: close race without using oom_lock

2017-08-04 Thread Tetsuo Handa

Michal Hocko wrote:
> On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Sun 23-07-17 09:41:50, Tetsuo Handa wrote:
> > > > So, how can we verify the above race a real problem?
> > > 
> > > Try to simulate a _real_ workload and see whether we kill more tasks
> > > than necessary. 
> > 
> > Whether it is a _real_ workload or not cannot become an answer.
> > 
> > If somebody is trying to allocate hundreds/thousands of pages after memory 
> > of
> > an OOM victim was reaped, avoiding this race window makes no sense; next OOM
> > victim will be selected anyway. But if somebody is trying to allocate only 
> > one
> > page and then is planning to release a lot of memory, avoiding this race 
> > window
> > can save somebody from being OOM-killed needlessly. This race window 
> > depends on
> > what the threads are about to do, not whether the workload is natural or
> > artificial.
> 
> And with a desparate lack of crystal ball we cannot do much about that
> really.
> 
> > My question is, how can users know it if somebody was OOM-killed needlessly
> > by allowing MMF_OOM_SKIP to race.
> 
> Is it really important to know that the race is due to MMF_OOM_SKIP?

Yes, it is really important. Needlessly selecting even one OOM victim is
a pain which is difficult to explain to and persuade some of customers.

> Isn't it sufficient to see that we kill too many tasks and then debug it
> further once something hits that?

It is not sufficient.

> 
> [...]
> > Is it guaranteed that __node_reclaim() never (even indirectly) waits for
> > __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation?
> 
> this is a direct reclaim which can go down to slab shrinkers with all
> the usual fun...

Excuse me, but does that mean "Yes, it is" ?

As far as I checked, most shrinkers use non-scheduling operations other than
cond_resched(). But some shrinkers use lock_page()/down_write() etc. I worry
that such shrinkers might wait for __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
memory allocation (i.e. "No, it isn't").

Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-04 Thread Tetsuo Handa

Michal Hocko wrote:
> And that's why we still see the corruption. That, however, means that
> the MMF_UNSTABLE implementation has to be more complex and we have to
> hook into all anonymous memory fault paths which I hoped I could avoid
> previously.

I don't understand mm internals including pte/ptl etc. , but I guess that
the direction is correct. Since the OOM reaper basically does

  Set MMF_UNSTABLE flag on mm_struct.
  For each reapable page in mm_struct {
Take ptl lock.
Remove pte.
Release ptl lock.
  }

the page fault handler will need to check MMF_UNSTABLE with lock held.

  For each faulted page in mm_struct {
Take ptl lock.
Add pte only if MMF_UNSTABLE flag is not set.
Release ptl lock.
  }

Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer

2017-08-04 Thread Tetsuo Handa

Michal Hocko wrote:
> On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> > Well, while lockdep warning is gone, this problem is remaining.
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index edabf6f..1e06c29 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, 
> > unsigned long address,
> > /*
> >  * This mm has been already reaped by the oom reaper and so the
> >  * refault cannot be trusted in general. Anonymous refaults would
> > -* lose data and give a zero page instead e.g. This is especially
> > -* problem for use_mm() because regular tasks will just die and
> > -* the corrupted data will not be visible anywhere while kthread
> > -* will outlive the oom victim and potentially propagate the date
> > -* further.
> > +* lose data and give a zero page instead e.g.
> >  */
> > -   if (unlikely((current->flags & PF_KTHREAD) && !(ret & 
> > VM_FAULT_ERROR)
> > -   && test_bit(MMF_UNSTABLE, 
> > &vma->vm_mm->flags)))
> > +   if (unlikely(!(ret & VM_FAULT_ERROR)
> > +&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
> > +   if (ret & VM_FAULT_RETRY)
> > +   down_read(&vma->vm_mm->mmap_sem);
> > ret = VM_FAULT_SIGBUS;
> > +   }
> > 
> > return ret;
> >  }
> 
> I have re-read your email again and I guess I misread previously. Are
> you saying that the data corruption happens with the both patches
> applied?

Yes. Data corruption still happens.

Re: suspicious __GFP_NOMEMALLOC in selinux

2017-08-03 Thread Tetsuo Handa

Michal Hocko wrote:
> On Thu 03-08-17 19:02:57, Tetsuo Handa wrote:
> > On 2017/08/03 17:11, Michal Hocko wrote:
> > > [CC Mel]
> > > 
> > > On Wed 02-08-17 17:45:56, Paul Moore wrote:
> > >> On Wed, Aug 2, 2017 at 6:50 AM, Michal Hocko  wrote:
> > >>> Hi,
> > >>> while doing something completely unrelated to selinux I've noticed a
> > >>> really strange __GFP_NOMEMALLOC usage pattern in selinux, especially
> > >>> GFP_ATOMIC | __GFP_NOMEMALLOC doesn't make much sense to me. GFP_ATOMIC
> > >>> on its own allows to access memory reserves while the later flag tells
> > >>> we cannot use memory reserves at all. The primary usecase for
> > >>> __GFP_NOMEMALLOC is to override a global PF_MEMALLOC should there be a
> > >>> need.
> > >>>
> > >>> It all leads to fa1aa143ac4a ("selinux: extended permissions for
> > >>> ioctls") which doesn't explain this aspect so let me ask. Why is the
> > >>> flag used at all? Moreover shouldn't GFP_ATOMIC be actually GFP_NOWAIT.
> > >>> What makes this path important to access memory reserves?
> > >>
> > >> [NOTE: added the SELinux list to the CC line, please include that list
> > >> when asking SELinux questions]
> > > 
> > > Sorry about that. Will keep it in mind for next posts
> > >  
> > >> The GFP_ATOMIC|__GFP_NOMEMALLOC use in SELinux appears to be limited
> > >> to security/selinux/avc.c, and digging a bit, I'm guessing commit
> > >> fa1aa143ac4a copied the combination from 6290c2c43973 ("selinux: tag
> > >> avc cache alloc as non-critical") and the avc_alloc_node() function.
> > > 
> > > Thanks for the pointer. That makes much more sense now. Back in 2012 we
> > > really didn't have a good way to distinguish non sleeping and atomic
> > > with reserves allocations.
> > >  
> > >> I can't say that I'm an expert at the vm subsystem and the variety of
> > >> different GFP_* flags, but your suggestion of moving to GFP_NOWAIT in
> > >> security/selinux/avc.c seems reasonable and in keeping with the idea
> > >> behind commit 6290c2c43973.
> > > 
> > > What do you think about the following? I haven't tested it but it should
> > > be rather straightforward.
> > 
> > Why not at least __GFP_NOWARN ?
> 
> This would require an additional justification.

If allocation failure is not a problem, printing allocation failure messages
is nothing but noisy.

> 
> > And why not also __GFP_NOMEMALLOC ?
> 
> What would be the purpose of __GFP_NOMEMALLOC? In other words which
> context would set PF_NOMEMALLOC so that the flag would override it?
> 

When allocating thread is selected as an OOM victim, it gets TIF_MEMDIE.
Since that function might be called from !in_interrupt() context, it is
possible that gfp_pfmemalloc_allowed() returns true due to TIF_MEMDIE and
the OOM victim will dip into memory reserves even when allocation failure
is not a problem.

Thus, I think that majority of plain GFP_NOWAIT users want to use
GFP_NOWAIT | __GFP_NOWARN | __GFP_NOMEMALLOC.

Re: suspicious __GFP_NOMEMALLOC in selinux

2017-08-03 Thread Tetsuo Handa

On 2017/08/03 17:11, Michal Hocko wrote:
> [CC Mel]
> 
> On Wed 02-08-17 17:45:56, Paul Moore wrote:
>> On Wed, Aug 2, 2017 at 6:50 AM, Michal Hocko  wrote:
>>> Hi,
>>> while doing something completely unrelated to selinux I've noticed a
>>> really strange __GFP_NOMEMALLOC usage pattern in selinux, especially
>>> GFP_ATOMIC | __GFP_NOMEMALLOC doesn't make much sense to me. GFP_ATOMIC
>>> on its own allows to access memory reserves while the later flag tells
>>> we cannot use memory reserves at all. The primary usecase for
>>> __GFP_NOMEMALLOC is to override a global PF_MEMALLOC should there be a
>>> need.
>>>
>>> It all leads to fa1aa143ac4a ("selinux: extended permissions for
>>> ioctls") which doesn't explain this aspect so let me ask. Why is the
>>> flag used at all? Moreover shouldn't GFP_ATOMIC be actually GFP_NOWAIT.
>>> What makes this path important to access memory reserves?
>>
>> [NOTE: added the SELinux list to the CC line, please include that list
>> when asking SELinux questions]
> 
> Sorry about that. Will keep it in mind for next posts
>  
>> The GFP_ATOMIC|__GFP_NOMEMALLOC use in SELinux appears to be limited
>> to security/selinux/avc.c, and digging a bit, I'm guessing commit
>> fa1aa143ac4a copied the combination from 6290c2c43973 ("selinux: tag
>> avc cache alloc as non-critical") and the avc_alloc_node() function.
> 
> Thanks for the pointer. That makes much more sense now. Back in 2012 we
> really didn't have a good way to distinguish non sleeping and atomic
> with reserves allocations.
>  
>> I can't say that I'm an expert at the vm subsystem and the variety of
>> different GFP_* flags, but your suggestion of moving to GFP_NOWAIT in
>> security/selinux/avc.c seems reasonable and in keeping with the idea
>> behind commit 6290c2c43973.
> 
> What do you think about the following? I haven't tested it but it should
> be rather straightforward.

Why not at least __GFP_NOWARN ? And why not also __GFP_NOMEMALLOC ?
http://lkml.kernel.org/r/201706302210.gca05089.mffotqvjsol...@i-love.sakura.ne.jp

Re: [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access

2017-08-03 Thread Tetsuo Handa

Michal Hocko wrote:
> Look, I really appreciate your sentiment for for nommu platform but with
> an absolute lack of _any_ oom reports on that platform that I am aware
> of nor any reports about lockups during oom I am less than thrilled to
> add a code to fix a problem which even might not exist. Nommu is usually
> very special with a very specific workload running (e.g. no overcommit)
> so I strongly suspect that any OOM theories are highly academic.

If you believe that there is really no oom report, get rid of the OOM
killer completely.

> 
> All I do care about is to not regress nommu as much as possible. So can
> we get back to the proposed patch and updates I have done to address
> your review feedback please?

No unless we get rid of the OOM killer if CONFIG_MMU=n.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 170db4d..e931969 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3312,7 +3312,8 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, 
const char *fmt, ...)
goto out;
 
/* Exhausted what can be done so it's blamo time */
-   if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
+   if ((IS_ENABLED(CONFIG_MMU) && out_of_memory(&oc)) ||
+   WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
*did_some_progress = 1;
 
/*

Re: [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access

2017-08-02 Thread Tetsuo Handa

Michal Hocko wrote:
> On Wed 02-08-17 00:30:33, Tetsuo Handa wrote:
> > > @@ -3603,6 +3612,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > >   return alloc_flags;
> > >  }
> > >  
> > > +static bool oom_reserves_allowed(struct task_struct *tsk)
> > > +{
> > > + if (!tsk_is_oom_victim(tsk))
> > > + return false;
> > > +
> > > + /*
> > > +  * !MMU doesn't have oom reaper so we shouldn't risk the memory reserves
> > > +  * depletion and shouldn't give access to memory reserves passed the
> > > +  * exit_mm
> > > +  */
> > > + if (!IS_ENABLED(CONFIG_MMU) && !tsk->mm)
> > > + return false;
> > 
> > Branching based on CONFIG_MMU is ugly. I suggest timeout based next OOM
> > victim selection if CONFIG_MMU=n.
> 
> I suggest we do not argue about nommu without actually optimizing for or
> fixing nommu which we are not here. I am even not sure memory reserves
> can ever be depleted for that config.

I don't think memory reserves can deplete for CONFIG_MMU=n environment.
But the reason the OOM reaper was introduced is not limited to handling
depletion of memory reserves. The OOM reaper was introduced because
OOM victims might get stuck indirectly waiting for other threads doing
memory allocation. You said

  > Yes, exit_aio is the only blocking call I know of currently. But I would
  > like this to be as robust as possible and so I do not want to rely on
  > the current implementation. This can change in future and I can
  > guarantee that nobody will think about the oom path when adding
  > something to the final __mmput path.

at http://lkml.kernel.org/r/20170726054533.ga...@dhcp22.suse.cz , but
how can you guarantee that nobody will think about the oom path
when adding something to the final __mmput() path without thinking
about possibility of getting stuck waiting for memory allocation in
CONFIG_MMU=n environment? As long as possibility of getting stuck remains,
you should not assume that something you don't want will not happen.
It's time to make CONFIG_MMU=n kernels treatable like CONFIG_MMU=y kernels.

If it is technically impossible (or is not worthwhile) to implement
the OOM reaper for CONFIG_MMU=n kernels, I'm fine with timeout based
approach like shown below. Then, we no longer need to use branching
based on CONFIG_MMU.

 include/linux/mm_types.h |  3 +++
 mm/oom_kill.c| 20 +++-
 2 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7f384bb..374a2ae 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -504,6 +504,9 @@ struct mm_struct {
atomic_long_t hugetlb_usage;
 #endif
struct work_struct async_put_work;
+#ifndef CONFIG_MMU
+   unsigned long oom_victim_wait_timer;
+#endif
 } __randomize_layout;
 
 extern struct mm_struct init_mm;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f0..dd6239d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -53,6 +53,17 @@
 
 DEFINE_MUTEX(oom_lock);
 
+static bool should_ignore_this_mm(struct mm_struct *mm)
+{
+#ifndef CONFIG_MMU
+   if (!mm->oom_victim_wait_timer)
+   mm->oom_victim_wait_timer = jiffies;
+   else if (time_after(jiffies, mm->oom_victim_wait_timer + HZ))
+   return true;
+#endif
+   return test_bit(MMF_OOM_SKIP, &mm->flags);
+};
+
 #ifdef CONFIG_NUMA
 /**
  * has_intersects_mems_allowed() - check task eligiblity for kill
@@ -188,9 +199,8 @@ unsigned long oom_badness(struct task_struct *p, struct 
mem_cgroup *memcg,
 * the middle of vfork
 */
adj = (long)p->signal->oom_score_adj;
-   if (adj == OOM_SCORE_ADJ_MIN ||
-   test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
-   in_vfork(p)) {
+   if (adj == OOM_SCORE_ADJ_MIN || should_ignore_this_mm(p->mm) ||
+   in_vfork(p)) {
task_unlock(p);
return 0;
}
@@ -303,7 +313,7 @@ static int oom_evaluate_task(struct task_struct *task, void 
*arg)
 * any memory is quite low.
 */
if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
-   if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
+   if (should_ignore_this_mm(task->signal->oom_mm))
goto next;
goto abort;
}
@@ -783,7 +793,7 @@ static bool task_will_free_mem(struct task_struct *task)
 * This task has already been drained by the oom reaper so there are
 * only small chances it will free some more
 */
-   if (test_bit(MMF_OOM_SKIP, &mm->flags))
+   if (should_ignore_this_mm(mm))
return false;
 
if (atomic_read(&mm->mm_users) <= 1)
-- 
1.8.3.1

Re: [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access

2017-08-01 Thread Tetsuo Handa

Michal Hocko wrote:
> CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
> ALLOC_NO_WATERMARKS approach but be careful because they still might
> deplete all the memory reserves so keep the semantic as close to the
> original implementation as possible and give them access to memory
> reserves only up to exit_mm (when tsk->mm is cleared) rather than while
> tsk_is_oom_victim which is until signal struct is gone.

Currently memory allocations from __mmput() can use memory reserves but
this patch changes __mmput() not to use memory reserves. You say "keep
the semantic as close to the original implementation as possible" but
this change is not guaranteed to be safe.

> @@ -2943,10 +2943,19 @@ bool __zone_watermark_ok(struct zone *z, unsigned int 
> order, unsigned long mark,
>* the high-atomic reserves. This will over-estimate the size of the
>* atomic reserve but it avoids a search.
>*/
> - if (likely(!alloc_harder))
> + if (likely(!alloc_harder)) {
>   free_pages -= z->nr_reserved_highatomic;
> - else
> - min -= min / 4;
> + } else {
> + /*
> +  * OOM victims can try even harder than normal ALLOC_HARDER
> +  * users
> +  */
> + if (alloc_flags & ALLOC_OOM)

ALLOC_OOM is ALLOC_NO_WATERMARKS if CONFIG_MMU=n.
I wonder this test makes sense for ALLOC_NO_WATERMARKS.

> + min -= min / 2;
> + else
> + min -= min / 4;
> + }
> +
>  
>  #ifdef CONFIG_CMA
>   /* If allocation can't use CMA areas don't use free CMA pages */
> @@ -3603,6 +3612,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>   return alloc_flags;
>  }
>  
> +static bool oom_reserves_allowed(struct task_struct *tsk)
> +{
> + if (!tsk_is_oom_victim(tsk))
> + return false;
> +
> + /*
> +  * !MMU doesn't have oom reaper so we shouldn't risk the memory reserves
> +  * depletion and shouldn't give access to memory reserves passed the
> +  * exit_mm
> +  */
> + if (!IS_ENABLED(CONFIG_MMU) && !tsk->mm)
> + return false;

Branching based on CONFIG_MMU is ugly. I suggest timeout based next OOM
victim selection if CONFIG_MMU=n. Then, we no longer need to worry about
memory reserves depletion and we can treat equally.

> +
> + return true;
> +}
> +
>  bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  {
>   if (unlikely(gfp_mask & __GFP_NOMEMALLOC))

> @@ -3770,6 +3795,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
> order,
>   unsigned long alloc_start = jiffies;
>   unsigned int stall_timeout = 10 * HZ;
>   unsigned int cpuset_mems_cookie;
> + bool reserves;
>  
>   /*
>* In the slowpath, we sanity check order to avoid ever trying to
> @@ -3875,15 +3901,24 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
> order,
>   if (gfp_mask & __GFP_KSWAPD_RECLAIM)
>   wake_all_kswapds(order, ac);
>  
> - if (gfp_pfmemalloc_allowed(gfp_mask))
> - alloc_flags = ALLOC_NO_WATERMARKS;
> + /*
> +  * Distinguish requests which really need access to whole memory
> +  * reserves from oom victims which can live with their own reserve
> +  */
> + reserves = gfp_pfmemalloc_allowed(gfp_mask);
> + if (reserves) {
> + if (tsk_is_oom_victim(current))
> + alloc_flags = ALLOC_OOM;

If reserves == true due to reasons other than tsk_is_oom_victim(current) == true
(e.g. __GFP_MEMALLOC), why dare to reduce it?

> + else
> + alloc_flags = ALLOC_NO_WATERMARKS;
> + }

If CONFIG_MMU=n, doing this test is silly.

if (tsk_is_oom_victim(current))
alloc_flags = ALLOC_NO_WATERMARKS;
else
alloc_flags = ALLOC_NO_WATERMARKS;

>  
>   /*
>* Reset the zonelist iterators if memory policies can be ignored.
>* These allocations are high priority and system rather than user
>* orientated.
>*/
> - if (!(alloc_flags & ALLOC_CPUSET) || (alloc_flags & 
> ALLOC_NO_WATERMARKS)) {
> + if (!(alloc_flags & ALLOC_CPUSET) || reserves) {
>   ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
>   ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
>   ac->high_zoneidx, ac->nodemask);
> @@ -3960,7 +3995,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
> order,
>   goto got_pg;
>  
>   /* Avoid allocations with no watermarks from looping endlessly */
> - if (test_thread_flag(TIF_MEMDIE) &&
> + if (tsk_is_oom_victim(current) &&
>   (alloc_flags == ALLOC_NO_WATERMARKS ||
>(gfp_mask & __GFP_NOMEMALLOC)))
>   goto nopage;

And you are silently changing to "!costly __GFP_DIRECT_RECLAIM allocations 
never fail
(even selected for OOM victims)" (i.e. updating the too small to fail memory 
allocation
rule) by doing alloc_flags

Re: Possible race condition in oom-killer

2017-08-01 Thread Tetsuo Handa

Michal Hocko wrote:
>   Once we merge [1] then the oom victim wouldn't
> need to get TIF_MEMDIE to access memory reserves.
> 
> [1] http://lkml.kernel.org/r/20170727090357.3205-2-mho...@kernel.org

False. We are not setting oom_mm to all thread groups (!CLONE_THREAD) sharing
that mm (CLONE_VM). Thus, one thread from each thread group sharing that mm
will have to call out_of_memory() in order to set oom_mm, and they will find
task_will_free_mem() returning false due to MMF_OOM_SKIP already set, and
after all goes to next OOM victim selection.

Re: Possible race condition in oom-killer

2017-08-01 Thread Tetsuo Handa

Michal Hocko wrote:
> > >   Is
> > > something other than the LTP test affected to give this more priority?
> > > Do we have other usecases where something mlocks the whole memory?
> > 
> > This panic was caused by 50 threads sharing MMF_OOM_SKIP mm exceeding
> > number of OOM killable processes. Whether memory is locked or not isn't
> > important.
> 
> You are wrong here I believe. The whole problem is that the OOM victim
> is consuming basically all the memory (that is what the test case
> actually does IIRC) and that memory is mlocked. oom_reaper is much
> faster to evaluate the mm of the victim and bail out sooner than the
> exit path actually manages to tear down the address space. And so we
> have to find other oom victims until we simply kill everything and
> panic.

Again, whether memory is locked or not isn't important. I can easily
reproduce unnecessary OOM victim selection as a local unprivileged user
using below program.

--
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define NUMTHREADS 128
#define MMAPSIZE 128 * 1048576
#define STACKSIZE 4096
static int pipe_fd[2] = { EOF, EOF };
static int memory_eater(void *unused)
{
int fd = open("/dev/zero", O_RDONLY);
char *buf = mmap(NULL, MMAPSIZE, PROT_WRITE | PROT_READ,
 MAP_ANONYMOUS | MAP_SHARED, EOF, 0);
read(pipe_fd[0], buf, 1);
read(fd, buf, MMAPSIZE);
pause();
return 0;
}
int main(int argc, char *argv[])
{
int i;
char *stack;
if (fork() || fork() || setsid() == EOF || pipe(pipe_fd))
_exit(0);
stack = mmap(NULL, STACKSIZE * NUMTHREADS, PROT_WRITE | PROT_READ,
 MAP_ANONYMOUS | MAP_SHARED, EOF, 0);
for (i = 0; i < NUMTHREADS; i++)
if (clone(memory_eater, stack + (i + 1) * STACKSIZE,
  CLONE_THREAD | CLONE_SIGHAND | CLONE_VM | CLONE_FS |
  CLONE_FILES, NULL) == -1)
break;
sleep(1);
close(pipe_fd[1]);
pause();
return 0;
}
--

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170801-2.txt.xz :
--
[  237.792768] [ pid ]   uid  tgid total_vm  rss nr_ptes nr_pmds swapents 
oom_score_adj name
[  237.795575] [  451] 0   451 9206  639  21   30   
  0 systemd-journal
[  237.798515] [  478] 0   47811138  740  25   30   
  -1000 systemd-udevd
[  237.801430] [  488] 0   48813856  100  26   30   
  -1000 auditd
[  237.804212] [  592]81   592 6135  119  18   30   
   -900 dbus-daemon
[  237.807166] [  668] 0   668 1094   23   8   30   
  0 rngd
[  237.809927] [  671]70   671 7029   75  19   40   
  0 avahi-daemon
[  237.812809] [  672] 0   67253144  402  57   40   
  0 abrtd
[  237.815611] [  675] 0   67526372  246  54   30   
  -1000 sshd
[  237.818358] [  679] 0   67952573  341  54   30   
  0 abrt-watch-log
[  237.821274] [  680] 0   680 6050   79  17   30   
  0 systemd-logind
[  237.824279] [  683] 0   683 4831   82  16   30   
  0 irqbalance
[  237.827119] [  698] 0   69856014  630  40   40   
  0 rsyslogd
[  237.829929] [  715]70   715 6997   59  18   40   
  0 avahi-daemon
[  237.832799] [  832] 0   83265453  228  44   30   
  0 vmtoolsd
[  237.835605] [  852] 0   85257168  353  58   30   
  0 vmtoolsd
[  237.838409] [  909] 0   90931558  155  20   30   
  0 crond
[  237.841160] [  986] 0   98684330  393 114   40   
  0 nmbd
[  237.843878] [ 1041] 0  104123728  168  51   30   
  0 login
[  237.846623] [ 2019] 0  201922261  252  43   30   
  0 master
[  237.849307] [ 2034] 0  203427511   33  13   30   
  0 agetty
[  237.851977] [ 2100]89  210022287  250  45   30   
  0 pickup
[  237.854607] [ 2101]89  210122304  251  45   30   
  0 qmgr
[  237.857179] [ 2597] 0  2597   102073  568 150   30   
  0 smbd
[  237.859773] [ 3905]  1000  390528885  133  15   40   
  0 bash
[  237.862337] [ 3952] 0  395227511   32  10   30   
  0 agetty
[  237.864905] [ 4772] 0  4772   102073  56

Re: Possible race condition in oom-killer

2017-07-28 Thread Tetsuo Handa

Michal Hocko wrote:
> On Fri 28-07-17 22:55:51, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 28-07-17 22:15:01, Tetsuo Handa wrote:
> > > > task_will_free_mem(current) in out_of_memory() returning false due to
> > > > MMF_OOM_SKIP already set allowed each thread sharing that mm to select 
> > > > a new
> > > > OOM victim. If task_will_free_mem(current) in out_of_memory() did not 
> > > > return
> > > > false, threads sharing MMF_OOM_SKIP mm would not have selected new 
> > > > victims
> > > > to the level where all OOM killable processes are killed and calls 
> > > > panic().
> > > 
> > > I am not sure I understand. Do you mean this?
> > 
> > Yes.
> > 
> > > ---
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 9e8b4f030c1c..671e4a4107d0 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -779,13 +779,6 @@ static bool task_will_free_mem(struct task_struct 
> > > *task)
> > >   if (!__task_will_free_mem(task))
> > >   return false;
> > >  
> > > - /*
> > > -  * This task has already been drained by the oom reaper so there are
> > > -  * only small chances it will free some more
> > > -  */
> > > - if (test_bit(MMF_OOM_SKIP, &mm->flags))
> > > - return false;
> > > -
> > >   if (atomic_read(&mm->mm_users) <= 1)
> > >   return true;
> > >  
> > > If yes I would have to think about this some more because that might
> > > have weird side effects (e.g. oom_victims counting after threads passed
> > > exit_oom_victim).
> > 
> > But this check should not be removed unconditionally. We should still return
> > false if returning true was not sufficient to solve the OOM situation, for
> > we need to select next OOM victim in that case.
> > 

I think that below one can manage this race condition.

---
 include/linux/sched.h |  1 +
 mm/oom_kill.c | 21 ++---
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0db4870..3fccf72 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -652,6 +652,7 @@ struct task_struct {
/* disallow userland-initiated cgroup migration */
unsignedno_cgroup_migration:1;
 #endif
+   unsignedoom_kill_free_check_raced:1;
 
unsigned long   atomic_flags; /* Flags requiring atomic 
access. */
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f0..a093193 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -779,13 +779,6 @@ static bool task_will_free_mem(struct task_struct *task)
if (!__task_will_free_mem(task))
return false;
 
-   /*
-* This task has already been drained by the oom reaper so there are
-* only small chances it will free some more
-*/
-   if (test_bit(MMF_OOM_SKIP, &mm->flags))
-   return false;
-
if (atomic_read(&mm->mm_users) <= 1)
return true;
 
@@ -806,6 +799,20 @@ static bool task_will_free_mem(struct task_struct *task)
}
rcu_read_unlock();
 
+   /*
+* It is possible that current thread fails to try allocation from
+* memory reserves if the OOM reaper set MMF_OOM_SKIP on this mm before
+* current thread calls out_of_memory() in order to get TIF_MEMDIE.
+* In that case, allow current thread to try TIF_MEMDIE allocation
+* before start selecting next OOM victims.
+*/
+   if (ret && test_bit(MMF_OOM_SKIP, &mm->flags)) {
+   if (task == current && !task->oom_kill_free_check_raced)
+   task->oom_kill_free_check_raced = true;
+   else
+   ret = false;
+   }
+
return ret;
 }
 
-- 
1.8.3.1

What is "oom_victims counting after threads passed exit_oom_victim" ?

Re: Possible race condition in oom-killer

2017-07-28 Thread Tetsuo Handa

Michal Hocko wrote:
> On Fri 28-07-17 22:15:01, Tetsuo Handa wrote:
> > task_will_free_mem(current) in out_of_memory() returning false due to
> > MMF_OOM_SKIP already set allowed each thread sharing that mm to select a new
> > OOM victim. If task_will_free_mem(current) in out_of_memory() did not return
> > false, threads sharing MMF_OOM_SKIP mm would not have selected new victims
> > to the level where all OOM killable processes are killed and calls panic().
> 
> I am not sure I understand. Do you mean this?

Yes.

> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9e8b4f030c1c..671e4a4107d0 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -779,13 +779,6 @@ static bool task_will_free_mem(struct task_struct *task)
>   if (!__task_will_free_mem(task))
>   return false;
>  
> - /*
> -  * This task has already been drained by the oom reaper so there are
> -  * only small chances it will free some more
> -  */
> - if (test_bit(MMF_OOM_SKIP, &mm->flags))
> - return false;
> -
>   if (atomic_read(&mm->mm_users) <= 1)
>   return true;
>  
> If yes I would have to think about this some more because that might
> have weird side effects (e.g. oom_victims counting after threads passed
> exit_oom_victim).

But this check should not be removed unconditionally. We should still return
false if returning true was not sufficient to solve the OOM situation, for
we need to select next OOM victim in that case.

> 
> Anyway the proper fix for this is to allow reaping mlocked pages.

Different approach is to set TIF_MEMDIE to all threads sharing the same
memory so that threads sharing MMF_OOM_SKIP mm do not need to call
out_of_memory() in order to get TIF_MEMDIE. 

Yet another apporach is to use __GFP_KILLABLE (we can start it as
best effort basis).

>   Is
> something other than the LTP test affected to give this more priority?
> Do we have other usecases where something mlocks the whole memory?

This panic was caused by 50 threads sharing MMF_OOM_SKIP mm exceeding
number of OOM killable processes. Whether memory is locked or not isn't
important. If a multi-threaded process which consumes little memory was
selected as an OOM victim (and reaped by the OOM reaper and MMF_OOM_SKIP
was set immediately), it might be still possible to select next OOM victims
needlessly.

Re: Possible race condition in oom-killer

2017-07-28 Thread Tetsuo Handa

Michal Hocko wrote:
> > 4578 is consuming memory as mlocked pages. But the OOM reaper cannot reclaim
> > mlocked pages (i.e. can_madv_dontneed_vma() returns false due to 
> > VM_LOCKED), can it?
> 
> You are absolutely right. I am pretty sure I've checked mlocked counter
> as the first thing but that must be from one of the earlier oom reports.
> My fault I haven't checked it in the critical one
> 
> [  365.267347] oom_reaper: reaped process 4578 (oom02), now 
> anon-rss:131559616kB, file-rss:0kB, shmem-rss:0kB
> [  365.282658] oom_reaper: reaped process 4583 (oom02), now 
> anon-rss:131561664kB, file-rss:0kB, shmem-rss:0kB
> 
> and the above screemed about the fact I was just completely blind.
> 
> mlock pages handling is on my todo list for quite some time already but
> I didn't get around it to implement that. mlock code is very tricky.

task_will_free_mem(current) in out_of_memory() returning false due to
MMF_OOM_SKIP already set allowed each thread sharing that mm to select a new
OOM victim. If task_will_free_mem(current) in out_of_memory() did not return
false, threads sharing MMF_OOM_SKIP mm would not have selected new victims
to the level where all OOM killable processes are killed and calls panic().

Re: Possible race condition in oom-killer

2017-07-28 Thread Tetsuo Handa

(Oops. Forgot to add CC.)

On 2017/07/28 21:32, Michal Hocko wrote:
> [CC linux-mm]
>
> On Fri 28-07-17 17:22:25, Manish Jaggi wrote:
>> was: Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
>>
>> Hi Michal,
>> On 7/27/2017 2:54 PM, Michal Hocko wrote:
>>> On Thu 27-07-17 13:59:09, Manish Jaggi wrote:
>>> [...]
 With 4.11.6 I was getting random kernel panics (Out of memory - No process 
 left to kill),
  when running LTP oom01 /oom02 ltp tests on our arm64 hardware with ~256G 
 memory and high core count.
 The issue experienced was as follows
that either test (oom01/oom02) selected a pid as victim and waited for 
 the pid to be killed.
that pid was marked as killed but somewhere there is a race and the 
 process didnt get killed.
and the oom01/oom02 test started killing further processes, till it 
 panics.
 IIUC this issue is quite similar to your patch description. But applying 
 your patch I still see the issue.
 If it is not related to this patch, can you please suggest by looking at 
 the log, what could be preventing
 the killing of victim.

 Log (https://pastebin.com/hg5iXRj2)

 As a subtest of oom02 starts, it prints out the victim - In this case 4578

 oom02   0  TINFO  :  start OOM testing for mlocked pages.
 oom02   0  TINFO  :  expected victim is 4578.

 When oom02 thread invokes oom-killer, it did select 4578  for killing...
>>> I will definitely have a look. Can you report it in a separate email
>>> thread please? Are you able to reproduce with the current Linus or
>>> linux-next trees?
>> Yes this issue is visible with linux-next.
>
> Could you provide the full kernel log from this run please? I do not
> expect there to be much difference but just to be sure that the code I
> am looking at matches logs.

4578 is consuming memory as mlocked pages. But the OOM reaper cannot reclaim
mlocked pages (i.e. can_madv_dontneed_vma() returns false due to VM_LOCKED), 
can it?

oom02   0  TINFO  :  start OOM testing for mlocked pages.
oom02   0  TINFO  :  expected victim is 4578.
[  365.267347] oom_reaper: reaped process 4578 (oom02), now 
anon-rss:131559616kB, file-rss:0kB, shmem-rss:0kB

As a result, MMF_OOM_SKIP is set without reclaiming much memory.
Thus, it is natural that subsequent OOM victims are selected immediately because
almost all memory is still in use. Since 4578 is multi-threaded (isn't it?),
it will take time to call final __mmput() because mm->users are large.
Since there are many threads, it is possible that all OOM killable processes are
killed before final __mmput() of 4578 (which releases mlocked pages) is called.

>
> [...]
 [  365.283361] oom02:4586 invoked oom-killer: 
 gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=1,  
 order=0, oom_score_adj=0
>>> Yes because
>>> [  365.283499] Node 1 Normal free:19500kB min:33804kB low:165916kB 
>>> high:298028kB active_anon:13312kB inactive_anon:172kB active_file:0kB 
>>> inactive_file:1044kB unevictable:131560064kB writepending:0kB 
>>> present:134213632kB managed:132113248kB mlocked:131560064kB 
>>> slab_reclaimable:5748kB slab_unreclaimable:17808kB kernel_stack:2720kB 
>>> pagetables:254636kB bounce:0kB free_pcp:10476kB local_pcp:144kB free_cma:0kB
>>>
>>> Although we have killed and reaped oom02 process Node1 is still below
>>> min watermark and that is why we have hit the oom killer again. It
>>> is not immediatelly clear to me why, that would require a deeper
>>> inspection.
>> I have a doubt here
>> my understanding of oom test: oom() function basically forks itself and
>> starts n threads each thread has a loop which allocates and touches memory
>> thus will trigger oom-killer and will kill the process. the parent process
>> is on a wait() and will print pass/fail.
>>
>> So IIUC when 4578 is reaped all the child threads should be terminated,
>> which happens in pass case (line 152)
>> But even after being killed and reaped,  the oom killer is invoked again
>> which doesn't seem right.
>
> As I've said the OOM killer hits because the memory from Node 1 didn't
> get freed for some reasov or got immediatally populated.

Because of mlocked pages by multi threaded process, it will take time to
reclaim mlocked pages.

>
>> Could it be that the process is just marked hidden from oom including its
>> threads, thus oom-killer continues.
>
> The whole process should be killed and the OOM reaper should only mark
> the victim oom invisible _after_ the address space has been reaped (and
> memory freed). You said the patch from
> http://lkml.kernel.org/r/20170724072332.31903-1-mho...@kernel.org didn't
> help so it shouldn't be a race with the last __mmput.
>
> Thanks!
>

Re: [PATCH 2/2] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim

2017-07-27 Thread Tetsuo Handa

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 544d47e5cbbd..86a48affb938 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1896,7 +1896,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
> > gfp_mask,
> >  * bypass the last charges so that they can exit quickly and
> >  * free their memory.
> >  */
> > -   if (unlikely(test_thread_flag(TIF_MEMDIE) ||
> > +   if (unlikely(tsk_is_oom_victim(current) ||
> >  fatal_signal_pending(current) ||
> >  current->flags & PF_EXITING))
> > goto force;
> 
> Did we check http://lkml.kernel.org/r/20160909140508.go4...@dhcp22.suse.cz ?
> 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index c9f3569a76c7..65cc2f9aaa05 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -483,7 +483,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
> > struct mm_struct *mm)
> >  *  [...]
> >  *  out_of_memory
> >  *select_bad_process
> > -*  # no TIF_MEMDIE task selects new 
> > victim
> > +*  # no TIF_MEMDIE, selects new victim
> >  *  unmap_page_range # frees some memory
> >  */
> > mutex_lock(&oom_lock);
> 
> This comment is wrong. No MMF_OOM_SKIP mm selects new victim.
> 
Oops. "MMF_OOM_SKIP mm selects new victim." according to
http://lkml.kernel.org/r/201706271952.feb21375.sfjfhoqlotv...@i-love.sakura.ne.jp
 .

< 7 8 9 10 11 12 13 14 15 16 >

1101 - 1200 of 2167 matches

Mail list logo