subject:"\"\\\[mm 4.15\\\-rc8\\\] Random oopses under memory pressure.\""

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-22 Thread Linus Torvalds

On Mon, Jan 22, 2018 at 5:26 AM, Rasmus Villemoes
 wrote:
> On 2018-01-19 19:42, Linus Torvalds wrote:
>>
>> I actually asked (long long ago) for an optinal compiler warning for
>> "pointer subtraction with non-power-of-2 sizes". Not because of it
>> being undefined, but simply because it's expensive. The
>> divide->multiply thing doesn't always work,
>
> Huh? If (compile-time constant, positive) d=m*2^k with m odd, and x is
> known to be a multiple of d, x/d can always be computed as (x>>k)*m' ==
> (x*m')>>k, with m' being the mod 2^N multiplicative inverse of m, right?

Hmm. I have definitely seen gcc generate the difference with a divide.

But it may be less of a "can't do it with a multiply" and more of a
"not all versions of gcc have the multiplication optimization".

Linus

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-22 Thread Rasmus Villemoes

On 2018-01-19 19:42, Linus Torvalds wrote:
> 
> I actually asked (long long ago) for an optinal compiler warning for
> "pointer subtraction with non-power-of-2 sizes". Not because of it
> being undefined, but simply because it's expensive. The
> divide->multiply thing doesn't always work,

Huh? If (compile-time constant, positive) d=m*2^k with m odd, and x is
known to be a multiple of d, x/d can always be computed as (x>>k)*m' ==
(x*m')>>k, with m' being the mod 2^N multiplicative inverse of m, right?
This works whether x has signed or unsigned type and whether it indeed
happens to be negative, AFAICT.

Sure, the multiplication by m' may not exactly be cheap, but one only
needs the low N bits of the result.

Rasmus

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-20 Thread Luc Van Oostenryck

On Sat, Jan 20, 2018 at 05:24:32AM +, Al Viro wrote:
> On Sat, Jan 20, 2018 at 02:02:37AM +, Al Viro wrote:
> 
> > Note that those sizes are rather sensitive to lockdep, spinlock debugging, 
> > etc.
> 
> That they certainly are: on one of the testing .config I'm using it gave this:
>1104 sizeof struct page = 56

Yes, I get this already with defconfig.
It's a problem with sparse which ignore the alignment attribute
(in fact all 'trailing' attributes in type declarations).

I'm looking to fix it.

-- Luc Van Oostenryck

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-19 Thread Al Viro

On Sat, Jan 20, 2018 at 02:02:37AM +, Al Viro wrote:

> Note that those sizes are rather sensitive to lockdep, spinlock debugging, 
> etc.

That they certainly are: on one of the testing .config I'm using it gave this:
   1104 sizeof struct page = 56
 81 sizeof struct cpufreq_frequency_table = 12
 32 sizeof struct Indirect = 24
  7 sizeof struct zone = 1400
  7 sizeof struct hstate = 152
  6 sizeof struct lock_class = 336
  6 sizeof struct hpet_dev = 152
  6 sizeof struct ext4_extent = 12
  4 sizeof struct ext4_extent_idx = 12
  3 sizeof struct mbox_chan = 456
  2 sizeof struct strip_zone = 24
  2 sizeof struct kobj_attribute = 48
  2 sizeof struct kernel_param = 40
  2 sizeof struct exception_table_entry = 12
  1 sizeof struct vif_device = 104
  1 sizeof struct unixware_slice = 12
  1 sizeof struct svc_pool = 152
  1 sizeof struct srcu_node = 152
  1 sizeof struct r5worker_group = 56
  1 sizeof struct pebs_record_core = 144
  1 sizeof struct netdev_queue = 384
  1 sizeof struct mirror = 40
  1 sizeof struct mif_device = 56
  1 sizeof struct e1000_tx_ring = 48
  1 sizeof struct dx_frame = 24
  1 sizeof struct bts_record = 24
  1 sizeof struct ata_device = 2560
  1 sizeof struct acpi_processor_cx = 52

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-19 Thread Al Viro

On Fri, Jan 19, 2018 at 02:53:25PM -0800, Linus Torvalds wrote:

> It would probably be good to add the size too, just to explain why
> it's potentially expensive.
> 
> That said, apparently we do have hundreds of them, with just
> cpufreq_frequency_table having a ton. Maybe some are hidden in macros
> and removing one removes a lot.

cpufreq_table_find_index_...(), mostly.

> The real problem is that sometimes the subtraction is simply the right
> thing to do, and there's no sane way to say "yeah, this is one of
> those cases you shouldn't warn about".

FWIW, the sizes of the most common ones are
 91 sizeof struct cpufreq_frequency_table = 12
Almost all of those come from
cpufreq_for_each_valid_entry(pos, table)
if ()
return pos - table;
and I wonder if we would be better off with something like
#define cpufreq_for_each_valid_entry(pos, table, idx)   
\
for (pos = table, idx = 0; pos->frequency != CPUFREQ_TABLE_END; pos++, 
idx++)   \
if (pos->frequency == CPUFREQ_ENTRY_INVALID)\
continue;   \
else
so that those loops would become
cpufreq_for_each_valid_entry(pos, table, idx)
if ()
return idx;
 36 sizeof struct Indirect = 24
 21 sizeof struct ips_scb = 216
 18 sizeof struct runlist_element = 24
 13 sizeof struct zone = 1728
Some are from
#define zone_idx(zone)  ((zone) - (zone)->zone_pgdat->node_zones)
but there's
static inline int zone_id(const struct zone *zone)
{ 
struct pglist_data *pgdat = zone->zone_pgdat;

return zone - pgdat->node_zones;
}
and a couple of places where we have
for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
Those bloody well ought to be
for (zone = node_zones, end = node_zones + MAX_NR_ZONES; zone < end; 
zone++) {
 13 sizeof struct vring = 40
 11 sizeof struct usbhsh_device = 24
 10 sizeof struct xpc_partition = 888
  9 sizeof struct skge_element = 40
  9 sizeof struct lock_class = 400
  9 sizeof struct hstate = 29872
That little horror comes from get_hstate_idx() and hstate_index().  Code 
generated
for division is
movabsq $-5542915600080909725, %rax
sarq$4, %rdi
imulq   %rdi, %rax
  7 sizeof struct nvme_rdma_queue = 312
  7 sizeof struct iso_context = 208
  6 sizeof struct i915_power_well = 48
  6 sizeof struct hpet_dev = 168
  6 sizeof struct ext4_extent = 12
  6 sizeof struct esas2r_target = 120
  5 sizeof struct iio_chan_spec = 152
  5 sizeof struct hwspinlock = 96
  4 sizeof struct myri10ge_slice_state = 704
  4 sizeof struct ext4_extent_idx = 12
Another interesting-looking one is struct vhost_net_virtqueue (18080 bytes)

Note that those sizes are rather sensitive to lockdep, spinlock debugging, etc.

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-19 Thread Linus Torvalds

On Fri, Jan 19, 2018 at 2:12 PM, Al Viro  wrote:
> On Fri, Jan 19, 2018 at 10:42:18AM -0800, Linus Torvalds wrote:
>>
>> We *should* be careful about it. I guess sparse could be made to warn,
>> but I'm afraid that we have so many of these things that a warning
>> isn't reasonable.
>
> You mean like -Wptr-subtraction-blows?

Heh. Apparently I already did that trivial warning back in 2005. I'd
forgotten about it.

> FWIW, allmodconfig on amd64 with C=2 CF=-Wptr-subtraction-blows is not too 
> large
>
> IOW it's not terribly noisy.  Might be an interesting idea to teach sparse to
> print the type in question...  Aha - with
>
> --- a/evaluate.c
> +++ b/evaluate.c
> @@ -848,7 +848,8 @@ static struct symbol *evaluate_ptr_sub(struct expression 
> *expr)
>
> if (value & (value-1)) {
> if (Wptr_subtraction_blows)
> -   warning(expr->pos, "potentially expensive 
> pointer subtraction");
> +   warning(expr->pos, "[%s] potentially 
> expensive pointer subtraction",
> +   show_typename(lbase));
> }
>
> sub->op = '-';
>
> we get things like
> drivers/gpu/drm/i915/i915_gem_execbuffer.c:435:17: warning: [struct 
> drm_i915_gem_exec_object2] potentially expensive pointer subtraction

It would probably be good to add the size too, just to explain why
it's potentially expensive.

That said, apparently we do have hundreds of them, with just
cpufreq_frequency_table having a ton. Maybe some are hidden in macros
and removing one removes a lot.

The real problem is that sometimes the subtraction is simply the right
thing to do, and there's no sane way to say "yeah, this is one of
those cases you shouldn't warn about".

 Linus

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-19 Thread Al Viro

On Fri, Jan 19, 2018 at 10:42:18AM -0800, Linus Torvalds wrote:
> On Fri, Jan 19, 2018 at 4:55 AM, Matthew Wilcox  wrote:
> >
> > So really we should be casting 'b' and 'a' to uintptr_t to be fully
> > compliant with the spec.
> 
> That's an unnecessary technicality.
> 
> Any compiler that doesn't get pointer inequality testing right is not
> worth even worrying about. We wouldn't want to use such a compiler,
> because it's intentionally generating garbage just to f*ck with us.
> Why would you go along with that?
> 
> So the only real issue is that pointer subtraction case.
> 
> I actually asked (long long ago) for an optinal compiler warning for
> "pointer subtraction with non-power-of-2 sizes". Not because of it
> being undefined, but simply because it's expensive. The
> divide->multiply thing doesn't always work, and a real divide is
> really quite expensive on many architectures.
> 
> We *should* be careful about it. I guess sparse could be made to warn,
> but I'm afraid that we have so many of these things that a warning
> isn't reasonable.

You mean like -Wptr-subtraction-blows?

FWIW, allmodconfig on amd64 with C=2 CF=-Wptr-subtraction-blows is not too large
The tail (alphabetically sorted) is
lib/dynamic_debug.c:1013:9: warning: potentially expensive pointer subtraction
lib/extable.c:70:28: warning: potentially expensive pointer subtraction
mm/memory_hotplug.c:1530:13: warning: potentially expensive pointer subtraction
mm/memory_hotplug.c:734:13: warning: potentially expensive pointer subtraction
mm/memory_hotplug.c:831:41: warning: potentially expensive pointer subtraction
mm/page_owner.c:607:38: warning: potentially expensive pointer subtraction
mm/vmstat.c:1334:38: warning: potentially expensive pointer subtraction
net/core/net-sysfs.c:1040:19: warning: potentially expensive pointer subtraction
net/ipv4/ipmr.c:3026:32: warning: potentially expensive pointer subtraction
net/ipv6/ip6mr.c:458:32: warning: potentially expensive pointer subtraction
net/mac80211/tx.c:1307:41: warning: potentially expensive pointer subtraction
net/mac80211/tx.c:1351:44: warning: potentially expensive pointer subtraction
net/rds/ib_recv.c:345:49: warning: potentially expensive pointer subtraction
net/rds/ib_recv.c:861:38: warning: potentially expensive pointer subtraction
net/sched/cls_tcindex.c:622:43: warning: potentially expensive pointer 
subtraction
net/sched/sch_cbs.c:302:35: warning: potentially expensive pointer subtraction
net/sched/sch_hhf.c:367:23: warning: potentially expensive pointer subtraction
net/sched/sch_hhf.c:434:38: warning: potentially expensive pointer subtraction
net/sunrpc/svc_xprt.c:1377:43: warning: potentially expensive pointer 
subtraction
sound/pci/hda/hda_generic.c:301:20: warning: potentially expensive pointer 
subtraction

IOW it's not terribly noisy.  Might be an interesting idea to teach sparse to
print the type in question...  Aha - with

--- a/evaluate.c
+++ b/evaluate.c
@@ -848,7 +848,8 @@ static struct symbol *evaluate_ptr_sub(struct expression 
*expr)
 
if (value & (value-1)) {
if (Wptr_subtraction_blows)
-   warning(expr->pos, "potentially expensive 
pointer subtraction");
+   warning(expr->pos, "[%s] potentially expensive 
pointer subtraction",
+   show_typename(lbase));
}
 
sub->op = '-';

we get things like
drivers/gpu/drm/i915/i915_gem_execbuffer.c:435:17: warning: [struct 
drm_i915_gem_exec_object2] potentially expensive pointer subtraction

OK, the top sources of that warning are:
 91 struct cpufreq_frequency_table
 36 struct Indirect (actually, that conflates 
ext2/ext4/minix/sysv)
 21 struct ips_scb
 18 struct runlist_element
 13 struct zone
 13 struct vring
 11 struct usbhsh_device
 10 struct xpc_partition
  9 struct skge_element 
  9 struct lock_class
  9 struct hstate
  7 struct nvme_rdma_queue
  7 struct iso_context  
  6 struct i915_power_well
  6 struct hpet_dev 
  6 struct ext4_extent
  6 struct esas2r_target
  5 struct iio_chan_spec
  5 struct hwspinlock
  4 struct myri10ge_slice_state
  4 struct ext4_extent_idx

everything beyond that is 3 instances or less...

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-19 Thread Linus Torvalds

On Fri, Jan 19, 2018 at 4:55 AM, Matthew Wilcox  wrote:
>
> So really we should be casting 'b' and 'a' to uintptr_t to be fully
> compliant with the spec.

That's an unnecessary technicality.

Any compiler that doesn't get pointer inequality testing right is not
worth even worrying about. We wouldn't want to use such a compiler,
because it's intentionally generating garbage just to f*ck with us.
Why would you go along with that?

So the only real issue is that pointer subtraction case.

I actually asked (long long ago) for an optinal compiler warning for
"pointer subtraction with non-power-of-2 sizes". Not because of it
being undefined, but simply because it's expensive. The
divide->multiply thing doesn't always work, and a real divide is
really quite expensive on many architectures.

We *should* be careful about it. I guess sparse could be made to warn,
but I'm afraid that we have so many of these things that a warning
isn't reasonable.

And most of the time, a pointer difference really is inside the same
array. The operation doesn't tend to make sense otherwise.

   Linus

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-19 Thread Matthew Wilcox

On Fri, Jan 19, 2018 at 02:49:55AM +0300, Kirill A. Shutemov wrote:
> > So that's why you can't do pointer diffs between two arrays. Not
> > because you can't subtract the two pointers, but because the
> > *division* part of the C pointer diff rules leads to issues.
> 
> Thanks a lot for the explanation!
> 
> I wounder if this may be a problem in other places?
> 
> For instance, perf uses address of a mutex to determinate the lock
> ordering. See mutex_lock_double(). The mutex is embedded into struct
> perf_event_context, which is allocated with kzalloc() so I don't see how
> we can presume that alignment is consistent between them.
> 
> I don't think it's the only example in kernel. Are we just lucky?

If you're just *comparing* the addresses of two objects, GCC doesn't
care what the size of the object is.  ie there's a difference between
'if (b < a)' and 'if ((a - b) < n)'.

But yes, if you go by the strict wording of the standard:

  When two pointers are compared, the result depends on the relative
  locations in the address space of the objects pointed to. [...] In
  all other cases, the behavior is undefined

http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1256.pdf

So really we should be casting 'b' and 'a' to uintptr_t to be fully
compliant with the spec.

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-19 Thread Kirill A. Shutemov

On Fri, Jan 19, 2018 at 12:07:47PM +, Michal Hocko wrote:
> > >From 861f68c555b87fd6c0ccc3428ace91b7e185b73a Mon Sep 17 00:00:00 2001
> > From: "Kirill A. Shutemov" 
> > Date: Thu, 18 Jan 2018 18:24:07 +0300
> > Subject: [PATCH] mm, page_vma_mapped: Drop faulty pointer arithmetics in
> >  check_pte()
> > 
> > Tetsuo reported random crashes under memory pressure on 32-bit x86
> > system and tracked down to change that introduced
> > page_vma_mapped_walk().
> > 
> > The root cause of the issue is the faulty pointer math in check_pte().
> > As ->pte may point to an arbitrary page we have to check that they are
> > belong to the section before doing math. Otherwise it may lead to weird
> > results.
> > 
> > It wasn't noticed until now as mem_map[] is virtually contiguous on flatmem 
> > or
> > vmemmap sparsemem. Pointer arithmetic just works against all 'struct page'
> > pointers. But with classic sparsemem, it doesn't.
> 
> it doesn't because each section memap is allocated separately and so
> consecutive pfns crossing two sections might have struct pages at
> completely unrelated addresses.

Okay, I'll amend it.

> > Let's restructure code a bit and replace pointer arithmetic with
> > operations on pfns.
> > 
> > Signed-off-by: Kirill A. Shutemov 
> > Reported-by: Tetsuo Handa 
> > Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
> > Cc: sta...@vger.kernel.org
> > Signed-off-by: Kirill A. Shutemov 
> 
> The patch makes sense but there is one more thing to fix here.
> 
> [...]
> >  static bool check_pte(struct page_vma_mapped_walk *pvmw)
> >  {
> > +   unsigned long pfn;
> > +
> > if (pvmw->flags & PVMW_MIGRATION) {
> >  #ifdef CONFIG_MIGRATION
> > swp_entry_t entry;
> > @@ -41,37 +61,34 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
> >  
> > if (!is_migration_entry(entry))
> > return false;
> > -   if (migration_entry_to_page(entry) - pvmw->page >=
> > -   hpage_nr_pages(pvmw->page)) {
> > -   return false;
> > -   }
> > -   if (migration_entry_to_page(entry) < pvmw->page)
> > -   return false;
> > +
> > +   pfn = migration_entry_to_pfn(entry);
> >  #else
> > WARN_ON_ONCE(1);
> >  #endif
> > -   } else {
> 
> now you allow to pass through with uninitialized pfn. We used to return
> true in that case so we should probably keep it in this WARN_ON_ONCE
> case. Please note that I haven't studied this particular case and the
> ifdef is definitely not an act of art but that is a separate topic.

Good catch. Thanks.

I think returning true here is wrong as we don't validate in any way what
is mapped there. I'll put "return false;".

And I take a look if we can drop the #ifdef.

-- 
 Kirill A. Shutemov

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-19 Thread Michal Hocko

On Fri 19-01-18 14:49:17, Kirill A. Shutemov wrote:
> On Fri, Jan 19, 2018 at 11:33:42AM +0100, Michal Hocko wrote:
> > On Fri 19-01-18 13:02:59, Kirill A. Shutemov wrote:
> > > On Thu, Jan 18, 2018 at 06:22:13PM +0100, Michal Hocko wrote:
> > > > On Thu 18-01-18 18:40:26, Kirill A. Shutemov wrote:
> > > > [...]
> > > > > + /*
> > > > > +  * Make sure that pages are in the same section before doing 
> > > > > pointer
> > > > > +  * arithmetics.
> > > > > +  */
> > > > > + if (page_to_section(pvmw->page) != page_to_section(page))
> > > > > + return false;
> > > > 
> > > > OK, THPs shouldn't cross memory sections AFAIK. My brain is meltdown
> > > > these days so this might be a completely stupid question. But why don't
> > > > you simply compare pfns? This would be just simpler, no?
> > > 
> > > In original code, we already had pvmw->page around and I thought it would
> > > be easier to get page for the pte intead of looking for pfn for both
> > > sides.
> > > 
> > > We these changes it's no longer the case.
> > > 
> > > Do you care enough to send a patch? :)
> > 
> > Well, memory sections are sparsemem concept IIRC. Unless I've missed
> > something page_to_section is quarded by SECTION_IN_PAGE_FLAGS and that
> > is conditional to CONFIG_SPARSEMEM. THP is a generic code so using it
> > there is wrong unless I miss some subtle detail here.
> > 
> > Comparing pfn should be generic enough.
> 
> Good point.
> 
> What about something like this?
> 
> >From 861f68c555b87fd6c0ccc3428ace91b7e185b73a Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" 
> Date: Thu, 18 Jan 2018 18:24:07 +0300
> Subject: [PATCH] mm, page_vma_mapped: Drop faulty pointer arithmetics in
>  check_pte()
> 
> Tetsuo reported random crashes under memory pressure on 32-bit x86
> system and tracked down to change that introduced
> page_vma_mapped_walk().
> 
> The root cause of the issue is the faulty pointer math in check_pte().
> As ->pte may point to an arbitrary page we have to check that they are
> belong to the section before doing math. Otherwise it may lead to weird
> results.
> 
> It wasn't noticed until now as mem_map[] is virtually contiguous on flatmem or
> vmemmap sparsemem. Pointer arithmetic just works against all 'struct page'
> pointers. But with classic sparsemem, it doesn't.

it doesn't because each section memap is allocated separately and so
consecutive pfns crossing two sections might have struct pages at
completely unrelated addresses.

> Let's restructure code a bit and replace pointer arithmetic with
> operations on pfns.
> 
> Signed-off-by: Kirill A. Shutemov 
> Reported-by: Tetsuo Handa 
> Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Kirill A. Shutemov 

The patch makes sense but there is one more thing to fix here.

[...]
>  static bool check_pte(struct page_vma_mapped_walk *pvmw)
>  {
> + unsigned long pfn;
> +
>   if (pvmw->flags & PVMW_MIGRATION) {
>  #ifdef CONFIG_MIGRATION
>   swp_entry_t entry;
> @@ -41,37 +61,34 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>  
>   if (!is_migration_entry(entry))
>   return false;
> - if (migration_entry_to_page(entry) - pvmw->page >=
> - hpage_nr_pages(pvmw->page)) {
> - return false;
> - }
> - if (migration_entry_to_page(entry) < pvmw->page)
> - return false;
> +
> + pfn = migration_entry_to_pfn(entry);
>  #else
>   WARN_ON_ONCE(1);
>  #endif
> - } else {

now you allow to pass through with uninitialized pfn. We used to return
true in that case so we should probably keep it in this WARN_ON_ONCE
case. Please note that I haven't studied this particular case and the
ifdef is definitely not an act of art but that is a separate topic.

> - if (is_swap_pte(*pvmw->pte)) {
> - swp_entry_t entry;
> + } else if (is_swap_pte(*pvmw->pte)) {
> + swp_entry_t entry;
>  
> - entry = pte_to_swp_entry(*pvmw->pte);
> - if (is_device_private_entry(entry) &&
> - device_private_entry_to_page(entry) == pvmw->page)
> - return true;
> - }
> + /* Handle un-addressable ZONE_DEVICE memory */
> + entry = pte_to_swp_entry(*pvmw->pte);
> + if (!is_device_private_entry(entry))
> + return false;
>  
> + pfn = device_private_entry_to_pfn(entry);
> + } else {
>   if (!pte_present(*pvmw->pte))
>   return false;
>  
> - /* THP can be referenced by any subpage */
> - if (pte_page(*pvmw->pte) - pvmw->page >=
> - hpage_nr_pages(pvmw->page)) {
> - return false;
> - }
> - if (pte_page

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-19 Thread Kirill A. Shutemov

On Fri, Jan 19, 2018 at 11:33:42AM +0100, Michal Hocko wrote:
> On Fri 19-01-18 13:02:59, Kirill A. Shutemov wrote:
> > On Thu, Jan 18, 2018 at 06:22:13PM +0100, Michal Hocko wrote:
> > > On Thu 18-01-18 18:40:26, Kirill A. Shutemov wrote:
> > > [...]
> > > > +   /*
> > > > +* Make sure that pages are in the same section before doing 
> > > > pointer
> > > > +* arithmetics.
> > > > +*/
> > > > +   if (page_to_section(pvmw->page) != page_to_section(page))
> > > > +   return false;
> > > 
> > > OK, THPs shouldn't cross memory sections AFAIK. My brain is meltdown
> > > these days so this might be a completely stupid question. But why don't
> > > you simply compare pfns? This would be just simpler, no?
> > 
> > In original code, we already had pvmw->page around and I thought it would
> > be easier to get page for the pte intead of looking for pfn for both
> > sides.
> > 
> > We these changes it's no longer the case.
> > 
> > Do you care enough to send a patch? :)
> 
> Well, memory sections are sparsemem concept IIRC. Unless I've missed
> something page_to_section is quarded by SECTION_IN_PAGE_FLAGS and that
> is conditional to CONFIG_SPARSEMEM. THP is a generic code so using it
> there is wrong unless I miss some subtle detail here.
> 
> Comparing pfn should be generic enough.

Good point.

What about something like this?

>From 861f68c555b87fd6c0ccc3428ace91b7e185b73a Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" 
Date: Thu, 18 Jan 2018 18:24:07 +0300
Subject: [PATCH] mm, page_vma_mapped: Drop faulty pointer arithmetics in
 check_pte()

Tetsuo reported random crashes under memory pressure on 32-bit x86
system and tracked down to change that introduced
page_vma_mapped_walk().

The root cause of the issue is the faulty pointer math in check_pte().
As ->pte may point to an arbitrary page we have to check that they are
belong to the section before doing math. Otherwise it may lead to weird
results.

It wasn't noticed until now as mem_map[] is virtually contiguous on flatmem or
vmemmap sparsemem. Pointer arithmetic just works against all 'struct page'
pointers. But with classic sparsemem, it doesn't.

Let's restructure code a bit and replace pointer arithmetic with
operations on pfns.

Signed-off-by: Kirill A. Shutemov 
Reported-by: Tetsuo Handa 
Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
Cc: sta...@vger.kernel.org
Signed-off-by: Kirill A. Shutemov 
---
 include/linux/swapops.h | 21 ++
 mm/page_vma_mapped.c| 59 +++--
 2 files changed, 59 insertions(+), 21 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 9c5a2628d6ce..1d3877c39a00 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -124,6 +124,11 @@ static inline bool 
is_write_device_private_entry(swp_entry_t entry)
return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
 }
 
+static inline unsigned long device_private_entry_to_pfn(swp_entry_t entry)
+{
+   return swp_offset(entry);
+}
+
 static inline struct page *device_private_entry_to_page(swp_entry_t entry)
 {
return pfn_to_page(swp_offset(entry));
@@ -154,6 +159,11 @@ static inline bool 
is_write_device_private_entry(swp_entry_t entry)
return false;
 }
 
+static inline unsigned long device_private_entry_to_pfn(swp_entry_t entry)
+{
+   return 0;
+}
+
 static inline struct page *device_private_entry_to_page(swp_entry_t entry)
 {
return NULL;
@@ -189,6 +199,11 @@ static inline int is_write_migration_entry(swp_entry_t 
entry)
return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
+static inline unsigned long migration_entry_to_pfn(swp_entry_t entry)
+{
+   return swp_offset(entry);
+}
+
 static inline struct page *migration_entry_to_page(swp_entry_t entry)
 {
struct page *p = pfn_to_page(swp_offset(entry));
@@ -218,6 +233,12 @@ static inline int is_migration_entry(swp_entry_t swp)
 {
return 0;
 }
+
+static inline unsigned long migration_entry_to_pfn(swp_entry_t entry)
+{
+   return 0;
+}
+
 static inline struct page *migration_entry_to_page(swp_entry_t entry)
 {
return NULL;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index d22b84310f6d..072ae9bc5fee 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -30,8 +30,28 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
return true;
 }
 
+/**
+ * check_pte - check if @pvmw->page is mapped at the @pvmw->pte
+ *
+ * page_vma_mapped_walk() found a place where @pvmw->page is *potentially*
+ * mapped. check_pte() has to validate this.
+ *
+ * @pvmw->pte may point to empty PTE, swap PTE or PTE pointing to arbitrary
+ * page.
+ *
+ * If PVMW_MIGRATION flag is set, returns true if @pvmw->pte contains migration
+ * entry that points to @pvmw->page or any subpage in case of THP.
+ *
+ * If PVMW_MIGRATION flag is not set, returns true if @pvmw->pte points to
+ * @pvmw->page

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-19 Thread Michal Hocko

On Fri 19-01-18 13:02:59, Kirill A. Shutemov wrote:
> On Thu, Jan 18, 2018 at 06:22:13PM +0100, Michal Hocko wrote:
> > On Thu 18-01-18 18:40:26, Kirill A. Shutemov wrote:
> > [...]
> > > + /*
> > > +  * Make sure that pages are in the same section before doing pointer
> > > +  * arithmetics.
> > > +  */
> > > + if (page_to_section(pvmw->page) != page_to_section(page))
> > > + return false;
> > 
> > OK, THPs shouldn't cross memory sections AFAIK. My brain is meltdown
> > these days so this might be a completely stupid question. But why don't
> > you simply compare pfns? This would be just simpler, no?
> 
> In original code, we already had pvmw->page around and I thought it would
> be easier to get page for the pte intead of looking for pfn for both
> sides.
> 
> We these changes it's no longer the case.
> 
> Do you care enough to send a patch? :)

Well, memory sections are sparsemem concept IIRC. Unless I've missed
something page_to_section is quarded by SECTION_IN_PAGE_FLAGS and that
is conditional to CONFIG_SPARSEMEM. THP is a generic code so using it
there is wrong unless I miss some subtle detail here.

Comparing pfn should be generic enough.
-- 
Michal Hocko
SUSE Labs

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-19 Thread Kirill A. Shutemov

On Thu, Jan 18, 2018 at 06:22:13PM +0100, Michal Hocko wrote:
> On Thu 18-01-18 18:40:26, Kirill A. Shutemov wrote:
> [...]
> > +   /*
> > +* Make sure that pages are in the same section before doing pointer
> > +* arithmetics.
> > +*/
> > +   if (page_to_section(pvmw->page) != page_to_section(page))
> > +   return false;
> 
> OK, THPs shouldn't cross memory sections AFAIK. My brain is meltdown
> these days so this might be a completely stupid question. But why don't
> you simply compare pfns? This would be just simpler, no?

In original code, we already had pvmw->page around and I thought it would
be easier to get page for the pte intead of looking for pfn for both
sides.

We these changes it's no longer the case.

Do you care enough to send a patch? :)

-- 
 Kirill A. Shutemov

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-18 Thread Tetsuo Handa

Kirill A. Shutemov wrote:
> Something like this?
> 
> 
> From 251e124630da82482e8b320c73162ce89af04d5d Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" 
> Date: Thu, 18 Jan 2018 18:24:07 +0300
> Subject: [PATCH] mm, page_vma_mapped: Fix pointer arithmetics in check_pte()
> 
> Tetsuo reported random crashes under memory pressure on 32-bit x86
> system and tracked down to change that introduced
> page_vma_mapped_walk().
> 
> The root cause of the issue is the faulty pointer math in check_pte().
> As ->pte may point to an arbitrary page we have to check that they are
> belong to the section before doing math. Otherwise it may lead to weird
> results.
> 
> It wasn't noticed until now as mem_map[] is virtually contiguous on flatmem or
> vmemmap sparsemem. Pointer arithmetic just works against all 'struct page'
> pointers. But with classic sparsemem, it doesn't.
> 
> Let's restructure code a bit and add necessary check.
> 
> Signed-off-by: Kirill A. Shutemov 
> Reported-by: Tetsuo Handa 
> Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
> Cc: sta...@vger.kernel.org

This patch solves the problem. Thank you.

Tested-by: Tetsuo Handa 

> ---
>  mm/page_vma_mapped.c | 66 
> +++-
>  1 file changed, 45 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index d22b84310f6d..de195dcdfbd8 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -30,8 +30,28 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
>   return true;
>  }
>  
> +/**
> + * check_pte - check if @pvmw->page is mapped at the @pvmw->pte
> + *
> + * page_vma_mapped_walk() found a place where @pvmw->page is *potentially*
> + * mapped. check_pte() has to validate this.
> + *
> + * @pvmw->pte may point to empty PTE, swap PTE or PTE pointing to arbitrary
> + * page.
> + *
> + * If PVMW_MIGRATION flag is set, returns true if @pvmw->pte contains 
> migration
> + * entry that points to @pvmw->page or any subpage in case of THP.
> + *
> + * If PVMW_MIGRATION flag is not set, returns true if @pvmw->pte points to
> + * @pvmw->page or any subpage in case of THP.
> + *
> + * Otherwise, return false.
> + *
> + */
>  static bool check_pte(struct page_vma_mapped_walk *pvmw)
>  {
> + struct page *page;
> +
>   if (pvmw->flags & PVMW_MIGRATION) {
>  #ifdef CONFIG_MIGRATION
>   swp_entry_t entry;
> @@ -41,37 +61,41 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>  
>   if (!is_migration_entry(entry))
>   return false;
> - if (migration_entry_to_page(entry) - pvmw->page >=
> - hpage_nr_pages(pvmw->page)) {
> - return false;
> - }
> - if (migration_entry_to_page(entry) < pvmw->page)
> - return false;
> +
> + page = migration_entry_to_page(entry);
>  #else
>   WARN_ON_ONCE(1);
>  #endif
> - } else {
> - if (is_swap_pte(*pvmw->pte)) {
> - swp_entry_t entry;
> + } else if (is_swap_pte(*pvmw->pte)) {
> + swp_entry_t entry;
>  
> - entry = pte_to_swp_entry(*pvmw->pte);
> - if (is_device_private_entry(entry) &&
> - device_private_entry_to_page(entry) == pvmw->page)
> - return true;
> - }
> + /* Handle un-addressable ZONE_DEVICE memory */
> + entry = pte_to_swp_entry(*pvmw->pte);
> + if (!is_device_private_entry(entry))
> + return false;
>  
> + page = device_private_entry_to_page(entry);
> + } else {
>   if (!pte_present(*pvmw->pte))
>   return false;
>  
> - /* THP can be referenced by any subpage */
> - if (pte_page(*pvmw->pte) - pvmw->page >=
> - hpage_nr_pages(pvmw->page)) {
> - return false;
> - }
> - if (pte_page(*pvmw->pte) < pvmw->page)
> - return false;
> + page = pte_page(*pvmw->pte);
>   }
>  
> + /*
> +  * Make sure that pages are in the same section before doing pointer
> +  * arithmetics.
> +  */
> + if (page_to_section(pvmw->page) != page_to_section(page))
> + return false;
> +
> + if (page < pvmw->page)
> + return false;
> +
> + /* THP can be referenced by any subpage */
> + if (page - pvmw->page >= hpage_nr_pages(pvmw->page))
> + return false;
> +
>   return true;
>  }

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-18 Thread Kirill A. Shutemov

On Thu, Jan 18, 2018 at 09:26:25AM -0800, Linus Torvalds wrote:
> On Thu, Jan 18, 2018 at 8:56 AM, Kirill A. Shutemov
>  wrote:
> >
> > I can't say I fully grasp how 'diff' got this value and how it leads to both
> > checks being false.
> 
> I think the problem is that page difference when they are in different 
> sections.
> 
> When you do
> 
>  pte_page(*pvmw->pte) - pvmw->page
> 
> then the compiler takes the pointer difference, and then divides by
> the size of "struct page" to get an index.
> 
> But - and this is important - it does so knowing that the division it
> does will have no modulus: the two 'struct page *' pointers are really
> in the same array, and they really are 'n*sizeof(struct page)' apart
> for some 'n'.
> 
> That means that the compiler can optimize the division. In fact, for
> this case, gcc will generate
> 
> subl%ebx, %eax
> sarl$3, %eax
> imull   $-858993459, %eax, %eax
> 
> because 'struct page' is 40 bytes in size, and that magic sequence
> happens to divide by 40 (first divide by 8, then that magical "imull"
> will divide by 5 *IFF* the thing is evenly divisible by 5 (and not too
> big - but the shift guarantees that).
> 
> Basically, it's a magic trick, because real divides are very
> expensive, but you can fake them more quickly if you can limit the
> input domain.
> 
> But what does it mean if the two "struct page *" are not in the same
> array, and the two arrays were allocated not aligned exactly 40 bytes
> away, but some random number of pages away?
> 
> You get *COMPLETE*GARBAGE* when you do the above optimized divide.
> Suddenly the divide had a modulus (because the base of the two arrays
> weren't 40-byte aligned), and the "trick" doesn't work.
> 
> So that's why you can't do pointer diffs between two arrays. Not
> because you can't subtract the two pointers, but because the
> *division* part of the C pointer diff rules leads to issues.

Thanks a lot for the explanation!

I wounder if this may be a problem in other places?

For instance, perf uses address of a mutex to determinate the lock
ordering. See mutex_lock_double(). The mutex is embedded into struct
perf_event_context, which is allocated with kzalloc() so I don't see how
we can presume that alignment is consistent between them.

I don't think it's the only example in kernel. Are we just lucky?

-- 
 Kirill A. Shutemov

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-18 Thread Linus Torvalds

On Thu, Jan 18, 2018 at 9:26 AM, Luck, Tony  wrote:
>> Both are real page. But why do you expect pages to be 64-byte alinged?
>> Both are aligned to 64-bit as they suppose to be IIUC.
>
> On a 64-bit kernel sizeof struct page == 64 (after much work by people to
> trim out excess stuff).  So I thought we made sure to align the base address
> of blocks of "struct page" so that every one neatly fits into one cache line.

The bug happens on 32-bit, and a 'struct page' is not 64-byte aligned
there at all.

See my other email about the explanation of why "page1 - page2"
doesn't work when they aren't mutually aligned to the actual size of
the 'struct page'.

 Linus

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-18 Thread Linus Torvalds

On Thu, Jan 18, 2018 at 8:56 AM, Kirill A. Shutemov
 wrote:
>
> I can't say I fully grasp how 'diff' got this value and how it leads to both
> checks being false.

I think the problem is that page difference when they are in different sections.

When you do

 pte_page(*pvmw->pte) - pvmw->page

then the compiler takes the pointer difference, and then divides by
the size of "struct page" to get an index.

But - and this is important - it does so knowing that the division it
does will have no modulus: the two 'struct page *' pointers are really
in the same array, and they really are 'n*sizeof(struct page)' apart
for some 'n'.

That means that the compiler can optimize the division. In fact, for
this case, gcc will generate

subl%ebx, %eax
sarl$3, %eax
imull   $-858993459, %eax, %eax

because 'struct page' is 40 bytes in size, and that magic sequence
happens to divide by 40 (first divide by 8, then that magical "imull"
will divide by 5 *IFF* the thing is evenly divisible by 5 (and not too
big - but the shift guarantees that).

Basically, it's a magic trick, because real divides are very
expensive, but you can fake them more quickly if you can limit the
input domain.

But what does it mean if the two "struct page *" are not in the same
array, and the two arrays were allocated not aligned exactly 40 bytes
away, but some random number of pages away?

You get *COMPLETE*GARBAGE* when you do the above optimized divide.
Suddenly the divide had a modulus (because the base of the two arrays
weren't 40-byte aligned), and the "trick" doesn't work.

So that's why you can't do pointer diffs between two arrays. Not
because you can't subtract the two pointers, but because the
*division* part of the C pointer diff rules leads to issues.

Linus

RE: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-18 Thread Luck, Tony

> Both are real page. But why do you expect pages to be 64-byte alinged?
> Both are aligned to 64-bit as they suppose to be IIUC.

On a 64-bit kernel sizeof struct page == 64 (after much work by people to
trim out excess stuff).  So I thought we made sure to align the base address
of blocks of "struct page" so that every one neatly fits into one cache line.

-Tony

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-18 Thread Michal Hocko

On Thu 18-01-18 18:40:26, Kirill A. Shutemov wrote:
[...]
> + /*
> +  * Make sure that pages are in the same section before doing pointer
> +  * arithmetics.
> +  */
> + if (page_to_section(pvmw->page) != page_to_section(page))
> + return false;

OK, THPs shouldn't cross memory sections AFAIK. My brain is meltdown
these days so this might be a completely stupid question. But why don't
you simply compare pfns? This would be just simpler, no?

> +
> + if (page < pvmw->page)
> + return false;
> +
> + /* THP can be referenced by any subpage */
> + if (page - pvmw->page >= hpage_nr_pages(pvmw->page))
> + return false;
> +
>   return true;
>  }
>  
> -- 
>  Kirill A. Shutemov

-- 
Michal Hocko
SUSE Labs

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-18 Thread Linus Torvalds

On Thu, Jan 18, 2018 at 6:38 AM, Dave Hansen
 wrote:
> On 01/18/2018 05:12 AM, Kirill A. Shutemov wrote:
>> - if (pte_page(*pvmw->pte) - pvmw->page >=
>> - hpage_nr_pages(pvmw->page)) {
>
> Is ->pte guaranteed to map a page which is within the same section as
> pvmw->page?  Otherwise, with sparsemem (non-vmemmap), the pointer
> arithmetic won't work.

Lovely.

Finally a reason for this bug that actually seems to make sense.

Thanks guys.

Tetsuo - does Kirill's latest patch fix this for you? The one with

Subject: [PATCH] mm, page_vma_mapped: Fix pointer arithmetics in check_pte()

in the body of the email? I'm really hoping it does, since this seems
to make sense.

 Linus

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-18 Thread Kirill A. Shutemov

On Thu, Jan 18, 2018 at 03:58:30PM +0100, Andrea Arcangeli wrote:
> On Thu, Jan 18, 2018 at 06:45:00AM -0800, Dave Hansen wrote:
> > On 01/18/2018 04:25 AM, Kirill A. Shutemov wrote:
> > > [   10.084024] diff: -858690919
> > > [   10.084258] hpage_nr_pages: 1
> > > [   10.084386] check1: 0
> > > [   10.084478] check2: 0
> > ...
> > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > > index d22b84310f6d..57b4397f1ea5 100644
> > > --- a/mm/page_vma_mapped.c
> > > +++ b/mm/page_vma_mapped.c
> > > @@ -70,6 +70,14 @@ static bool check_pte(struct page_vma_mapped_walk 
> > > *pvmw)
> > >   }
> > >   if (pte_page(*pvmw->pte) < pvmw->page)
> > >   return false;
> > > +
> > > + if (pte_page(*pvmw->pte) - pvmw->page) {
> > > + printk("diff: %d\n", pte_page(*pvmw->pte) - pvmw->page);
> > > + printk("hpage_nr_pages: %d\n", 
> > > hpage_nr_pages(pvmw->page));
> > > + printk("check1: %d\n", pte_page(*pvmw->pte) - 
> > > pvmw->page < 0);
> > > + printk("check2: %d\n", pte_page(*pvmw->pte) - 
> > > pvmw->page >= hpage_nr_pages(pvmw->page));
> > > + BUG();
> > > + }
> > 
> > This says that pte_page(*pvmw->pte) and pvmw->page are roughly 4GB away
> > from each other (858690919*4=0xccba559c0).  That's not the compiler
> > being wonky, it just means that the virtual addresses of the memory
> > sections are that far apart.
> > 
> > This won't happen when you have vmemmap or flatmem because the mem_map[]
> > is virtually contiguous and pointer arithmetic just works against all
> > 'struct page' pointers.  But with classic sparsemem, it doesn't.
> > 
> > You need to make sure that the PFNs are in the same section before you
> > can do the math that you want to do here.
> 
> Isn't it simply that pvmw->page isn't a page or pte_page(*pvmw->pte)
> isn't a page?
> 
> The distance cannot matter, MMU isn't involved, this is pure 64bit
> aritmetics, 1giga 1 terabyte, 48bits 5level pagetables are meaningless
> in this comparison.

Note, it's 32-bit.

> 
> #include 
> 
> int main()
> {
>   volatile long i;
>   struct x { char a[40]; };
>   for (i = 0; i < 40*3; i += 40) {
>   printf("%ld\n", ((struct x *)0)-struct x *)i;
>   }
>   printf("\n");
>   for (i = 0; i < 40; i += 1) {
>   if (i==4)
>   i = 40;
>   printf("%ld\n", ((struct x *)0)-struct x *)i;
>   }
>   return 0;
> }
> 
> You need to add two debug checks on "pte_page(*pvmw->pte) % 64" and
> same for pvmw->page to find out the one of the two that isn't a page.
> 
> If both are real pages there's a bug that allocates page structs not
> naturally aligned.

Both are real page. But why do you expect pages to be 64-byte alinged?
Both are aligned to 64-bit as they suppose to be IIUC.

I can't say I fully grasp how 'diff' got this value and how it leads to both
checks being false.

[   10.209657] page:f6e4cc38 count:8 mapcount:6 mapping:f5818f94 index:0x0
[   10.209989] flags: 0x3501004d(locked|referenced|uptodate|active|mappedtodisk)
[   10.210496] raw: 3501004d f5818f94  0005 0008 0100 
0200 
[   10.210742] raw: f5c06600 
[   10.210863] page dumped because: pvmw->page
[   10.210992] page->mem_cgroup:f5c06600
[   10.211192] page:f74749d8 count:1 mapcount:1 mapping:f54612d1 index:0x0
[   10.211381] flags: 0x15040068(uptodate|lru|active|swapbacked)
[   10.211530] raw: 15040068 f54612d1   0001 f74749c4 
f75b9014 
[   10.211729] raw: f5c06600 
[   10.211806] page dumped because: pte_page(*pvmw->pte)
[   10.211920] page->mem_cgroup:f5c06600
[   10.212079] diff: -858832092
[   10.212184] hpage_nr_pages: 1
[   10.212284] check1: 0
[   10.212384] check2: 0

-- 
 Kirill A. Shutemov

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-18 Thread Kirill A. Shutemov

On Thu, Jan 18, 2018 at 06:45:00AM -0800, Dave Hansen wrote:
> On 01/18/2018 04:25 AM, Kirill A. Shutemov wrote:
> > [   10.084024] diff: -858690919
> > [   10.084258] hpage_nr_pages: 1
> > [   10.084386] check1: 0
> > [   10.084478] check2: 0
> ...
> > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > index d22b84310f6d..57b4397f1ea5 100644
> > --- a/mm/page_vma_mapped.c
> > +++ b/mm/page_vma_mapped.c
> > @@ -70,6 +70,14 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
> > }
> > if (pte_page(*pvmw->pte) < pvmw->page)
> > return false;
> > +
> > +   if (pte_page(*pvmw->pte) - pvmw->page) {
> > +   printk("diff: %d\n", pte_page(*pvmw->pte) - pvmw->page);
> > +   printk("hpage_nr_pages: %d\n", 
> > hpage_nr_pages(pvmw->page));
> > +   printk("check1: %d\n", pte_page(*pvmw->pte) - 
> > pvmw->page < 0);
> > +   printk("check2: %d\n", pte_page(*pvmw->pte) - 
> > pvmw->page >= hpage_nr_pages(pvmw->page));
> > +   BUG();
> > +   }
> 
> This says that pte_page(*pvmw->pte) and pvmw->page are roughly 4GB away
> from each other (858690919*4=0xccba559c0).  That's not the compiler
> being wonky, it just means that the virtual addresses of the memory
> sections are that far apart.
> 
> This won't happen when you have vmemmap or flatmem because the mem_map[]
> is virtually contiguous and pointer arithmetic just works against all
> 'struct page' pointers.  But with classic sparsemem, it doesn't.
> 
> You need to make sure that the PFNs are in the same section before you
> can do the math that you want to do here.

Something like this?


>From 251e124630da82482e8b320c73162ce89af04d5d Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" 
Date: Thu, 18 Jan 2018 18:24:07 +0300
Subject: [PATCH] mm, page_vma_mapped: Fix pointer arithmetics in check_pte()

Tetsuo reported random crashes under memory pressure on 32-bit x86
system and tracked down to change that introduced
page_vma_mapped_walk().

The root cause of the issue is the faulty pointer math in check_pte().
As ->pte may point to an arbitrary page we have to check that they are
belong to the section before doing math. Otherwise it may lead to weird
results.

It wasn't noticed until now as mem_map[] is virtually contiguous on flatmem or
vmemmap sparsemem. Pointer arithmetic just works against all 'struct page'
pointers. But with classic sparsemem, it doesn't.

Let's restructure code a bit and add necessary check.

Signed-off-by: Kirill A. Shutemov 
Reported-by: Tetsuo Handa 
Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
Cc: sta...@vger.kernel.org
---
 mm/page_vma_mapped.c | 66 +++-
 1 file changed, 45 insertions(+), 21 deletions(-)

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index d22b84310f6d..de195dcdfbd8 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -30,8 +30,28 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
return true;
 }
 
+/**
+ * check_pte - check if @pvmw->page is mapped at the @pvmw->pte
+ *
+ * page_vma_mapped_walk() found a place where @pvmw->page is *potentially*
+ * mapped. check_pte() has to validate this.
+ *
+ * @pvmw->pte may point to empty PTE, swap PTE or PTE pointing to arbitrary
+ * page.
+ *
+ * If PVMW_MIGRATION flag is set, returns true if @pvmw->pte contains migration
+ * entry that points to @pvmw->page or any subpage in case of THP.
+ *
+ * If PVMW_MIGRATION flag is not set, returns true if @pvmw->pte points to
+ * @pvmw->page or any subpage in case of THP.
+ *
+ * Otherwise, return false.
+ *
+ */
 static bool check_pte(struct page_vma_mapped_walk *pvmw)
 {
+   struct page *page;
+
if (pvmw->flags & PVMW_MIGRATION) {
 #ifdef CONFIG_MIGRATION
swp_entry_t entry;
@@ -41,37 +61,41 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 
if (!is_migration_entry(entry))
return false;
-   if (migration_entry_to_page(entry) - pvmw->page >=
-   hpage_nr_pages(pvmw->page)) {
-   return false;
-   }
-   if (migration_entry_to_page(entry) < pvmw->page)
-   return false;
+
+   page = migration_entry_to_page(entry);
 #else
WARN_ON_ONCE(1);
 #endif
-   } else {
-   if (is_swap_pte(*pvmw->pte)) {
-   swp_entry_t entry;
+   } else if (is_swap_pte(*pvmw->pte)) {
+   swp_entry_t entry;
 
-   entry = pte_to_swp_entry(*pvmw->pte);
-   if (is_device_private_entry(entry) &&
-   device_private_entry_to_page(entry) == pvmw->page)
-   return true;
-   }
+   /* Handle un-addressable ZONE_DEVICE memory */
+   entry = pte_to_

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-18 Thread Andrea Arcangeli

On Thu, Jan 18, 2018 at 06:45:00AM -0800, Dave Hansen wrote:
> On 01/18/2018 04:25 AM, Kirill A. Shutemov wrote:
> > [   10.084024] diff: -858690919
> > [   10.084258] hpage_nr_pages: 1
> > [   10.084386] check1: 0
> > [   10.084478] check2: 0
> ...
> > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > index d22b84310f6d..57b4397f1ea5 100644
> > --- a/mm/page_vma_mapped.c
> > +++ b/mm/page_vma_mapped.c
> > @@ -70,6 +70,14 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
> > }
> > if (pte_page(*pvmw->pte) < pvmw->page)
> > return false;
> > +
> > +   if (pte_page(*pvmw->pte) - pvmw->page) {
> > +   printk("diff: %d\n", pte_page(*pvmw->pte) - pvmw->page);
> > +   printk("hpage_nr_pages: %d\n", 
> > hpage_nr_pages(pvmw->page));
> > +   printk("check1: %d\n", pte_page(*pvmw->pte) - 
> > pvmw->page < 0);
> > +   printk("check2: %d\n", pte_page(*pvmw->pte) - 
> > pvmw->page >= hpage_nr_pages(pvmw->page));
> > +   BUG();
> > +   }
> 
> This says that pte_page(*pvmw->pte) and pvmw->page are roughly 4GB away
> from each other (858690919*4=0xccba559c0).  That's not the compiler
> being wonky, it just means that the virtual addresses of the memory
> sections are that far apart.
> 
> This won't happen when you have vmemmap or flatmem because the mem_map[]
> is virtually contiguous and pointer arithmetic just works against all
> 'struct page' pointers.  But with classic sparsemem, it doesn't.
> 
> You need to make sure that the PFNs are in the same section before you
> can do the math that you want to do here.

Isn't it simply that pvmw->page isn't a page or pte_page(*pvmw->pte)
isn't a page?

The distance cannot matter, MMU isn't involved, this is pure 64bit
aritmetics, 1giga 1 terabyte, 48bits 5level pagetables are meaningless
in this comparison.

#include 

int main()
{
volatile long i;
struct x { char a[40]; };
for (i = 0; i < 40*3; i += 40) {
printf("%ld\n", ((struct x *)0)-struct x *)i;
}
printf("\n");
for (i = 0; i < 40; i += 1) {
if (i==4)
i = 40;
printf("%ld\n", ((struct x *)0)-struct x *)i;
}
return 0;
}

You need to add two debug checks on "pte_page(*pvmw->pte) % 64" and
same for pvmw->page to find out the one of the two that isn't a page.

If both are real pages there's a bug that allocates page structs not
naturally aligned.

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-18 Thread Dave Hansen

On 01/18/2018 06:45 AM, Kirill A. Shutemov wrote:
> On Thu, Jan 18, 2018 at 06:38:10AM -0800, Dave Hansen wrote:
>> On 01/18/2018 05:12 AM, Kirill A. Shutemov wrote:
>>> -   if (pte_page(*pvmw->pte) - pvmw->page >=
>>> -   hpage_nr_pages(pvmw->page)) {
>> Is ->pte guaranteed to map a page which is within the same section as
>> pvmw->page?  Otherwise, with sparsemem (non-vmemmap), the pointer
>> arithmetic won't work.
> No, it's not guaranteed. It can be arbitrary page.
> 
> The arithmetic won't work because they are different "memory objects"?

No, because sections' mem_map[]s can be allocated non-contiguously.
Section 1's might be a lower virtual address than Section 0's.

They're allocated not unlike this:

mem_section[0]->section_mem_map = kmalloc(SECTION_SIZE);
mem_section[1]->section_mem_map = kmalloc(SECTION_SIZE);
...

The first pfn in section 1 and the last pfn in section 0 are adjacent
PFNs, but their 'struct page' might have virtual addresses that are TB
apart.

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-18 Thread Kirill A. Shutemov

On Thu, Jan 18, 2018 at 06:38:10AM -0800, Dave Hansen wrote:
> On 01/18/2018 05:12 AM, Kirill A. Shutemov wrote:
> > -   if (pte_page(*pvmw->pte) - pvmw->page >=
> > -   hpage_nr_pages(pvmw->page)) {
> 
> Is ->pte guaranteed to map a page which is within the same section as
> pvmw->page?  Otherwise, with sparsemem (non-vmemmap), the pointer
> arithmetic won't work.

No, it's not guaranteed. It can be arbitrary page.

The arithmetic won't work because they are different "memory objects"?

-- 
 Kirill A. Shutemov

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-18 Thread Dave Hansen

On 01/18/2018 04:25 AM, Kirill A. Shutemov wrote:
> [   10.084024] diff: -858690919
> [   10.084258] hpage_nr_pages: 1
> [   10.084386] check1: 0
> [   10.084478] check2: 0
...
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index d22b84310f6d..57b4397f1ea5 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -70,6 +70,14 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>   }
>   if (pte_page(*pvmw->pte) < pvmw->page)
>   return false;
> +
> + if (pte_page(*pvmw->pte) - pvmw->page) {
> + printk("diff: %d\n", pte_page(*pvmw->pte) - pvmw->page);
> + printk("hpage_nr_pages: %d\n", 
> hpage_nr_pages(pvmw->page));
> + printk("check1: %d\n", pte_page(*pvmw->pte) - 
> pvmw->page < 0);
> + printk("check2: %d\n", pte_page(*pvmw->pte) - 
> pvmw->page >= hpage_nr_pages(pvmw->page));
> + BUG();
> + }

This says that pte_page(*pvmw->pte) and pvmw->page are roughly 4GB away
from each other (858690919*4=0xccba559c0).  That's not the compiler
being wonky, it just means that the virtual addresses of the memory
sections are that far apart.

This won't happen when you have vmemmap or flatmem because the mem_map[]
is virtually contiguous and pointer arithmetic just works against all
'struct page' pointers.  But with classic sparsemem, it doesn't.

You need to make sure that the PFNs are in the same section before you
can do the math that you want to do here.

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-18 Thread Dave Hansen

On 01/18/2018 05:12 AM, Kirill A. Shutemov wrote:
> - if (pte_page(*pvmw->pte) - pvmw->page >=
> - hpage_nr_pages(pvmw->page)) {

Is ->pte guaranteed to map a page which is within the same section as
pvmw->page?  Otherwise, with sparsemem (non-vmemmap), the pointer
arithmetic won't work.

WARN_ON_ONCE(page_to_section(pvmw->page) !=
 page_to_section(pte_page(*pvmw->pte));

Will detect that.

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-18 Thread Kirill A. Shutemov

On Thu, Jan 18, 2018 at 04:12:10PM +0300, Kirill A. Shutemov wrote:
> On Thu, Jan 18, 2018 at 03:25:50PM +0300, Kirill A. Shutemov wrote:
> > On Thu, Jan 18, 2018 at 05:12:45PM +0900, Tetsuo Handa wrote:
> > > Tetsuo Handa wrote:
> > > > OK. I missed the mark. I overlooked that 4.11 already has this problem.
> > > > 
> > > > I needed to bisect between 4.10 and 4.11, and I got plausible culprit.
> > > > 
> > > > I haven't completed bisecting between b4fb8f66f1ae2e16 and 
> > > > c470abd4fde40ea6, but
> > > > b4fb8f66f1ae2e16 ("mm, page_alloc: Add missing check for memory holes") 
> > > > and
> > > > 13ad59df67f19788 ("mm, page_alloc: avoid page_to_pfn() when merging 
> > > > buddies")
> > > > are talking about memory holes, which matches the situation that I'm 
> > > > trivially
> > > > hitting the bug if CONFIG_SPARSEMEM=y .
> > > > 
> > > > Thus, I call for an attention by speculative execution. ;-)
> > > 
> > > Speculative execution failed. I was confused by jiffies precision bug.
> > > The final culprit is c7ab0d2fdc840266 ("mm: convert try_to_unmap_one() to 
> > > use page_vma_mapped_walk()").
> > 
> > I think I've tracked it down. check_pte() in mm/page_vma_mapped.c doesn't
> > work as intended.
> > 
> > I've added instrumentation below to prove it.
> > 
> > The BUG() triggers with following output:
> > 
> > [   10.084024] diff: -858690919
> > [   10.084258] hpage_nr_pages: 1
> > [   10.084386] check1: 0
> > [   10.084478] check2: 0
> > 
> > Basically, pte_page(*pvmw->pte) is below pvmw->page, but
> > (pte_page(*pvmw->pte) < pvmw->page) doesn't catch it.
> > 
> > Well, I can see how C lawyer can argue that you can only compare pointers
> > of the same memory object which is not the case here. But this is kinda
> > insane.
> > 
> > Any suggestions how to rewrite it in a way that compiler would
> > understand?
> 
> The patch below makes the crash go away for me.
> 
> But this is situation is scary. So we cannot compare arbitrary pointers in
> kernel?
> 
> Don't we rely on this for lock ordering in some cases? Like in
> mutex_lock_double()?
> 
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index d22b84310f6d..1f0f512fd127 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -51,6 +51,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>   WARN_ON_ONCE(1);
>  #endif
>   } else {
> + unsigned long ptr1, ptr2;
> +
>   if (is_swap_pte(*pvmw->pte)) {
>   swp_entry_t entry;
>  
> @@ -63,12 +65,14 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>   if (!pte_present(*pvmw->pte))
>   return false;
>  
> - /* THP can be referenced by any subpage */
> - if (pte_page(*pvmw->pte) - pvmw->page >=
> - hpage_nr_pages(pvmw->page)) {
> + ptr1 = (unsigned long)pte_page(*pvmw->pte);
> + ptr2 = (unsigned long)pvmw->page;
> +
> + if (ptr1 < ptr2)
>   return false;
> - }
> - if (pte_page(*pvmw->pte) < pvmw->page)
> +
> + /* THP can be referenced by any subpage */
> + if (ptr1 - ptr2 >= hpage_nr_pages(pvmw->page))

Arghhh.. It has to be

if (ptr1 - ptr2 >= hpage_nr_pages(pvmw->page) * 
sizeof(*pvmw->page))

-- 
 Kirill A. Shutemov

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-18 Thread Kirill A. Shutemov

On Thu, Jan 18, 2018 at 03:25:50PM +0300, Kirill A. Shutemov wrote:
> On Thu, Jan 18, 2018 at 05:12:45PM +0900, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > > OK. I missed the mark. I overlooked that 4.11 already has this problem.
> > > 
> > > I needed to bisect between 4.10 and 4.11, and I got plausible culprit.
> > > 
> > > I haven't completed bisecting between b4fb8f66f1ae2e16 and 
> > > c470abd4fde40ea6, but
> > > b4fb8f66f1ae2e16 ("mm, page_alloc: Add missing check for memory holes") 
> > > and
> > > 13ad59df67f19788 ("mm, page_alloc: avoid page_to_pfn() when merging 
> > > buddies")
> > > are talking about memory holes, which matches the situation that I'm 
> > > trivially
> > > hitting the bug if CONFIG_SPARSEMEM=y .
> > > 
> > > Thus, I call for an attention by speculative execution. ;-)
> > 
> > Speculative execution failed. I was confused by jiffies precision bug.
> > The final culprit is c7ab0d2fdc840266 ("mm: convert try_to_unmap_one() to 
> > use page_vma_mapped_walk()").
> 
> I think I've tracked it down. check_pte() in mm/page_vma_mapped.c doesn't
> work as intended.
> 
> I've added instrumentation below to prove it.
> 
> The BUG() triggers with following output:
> 
> [   10.084024] diff: -858690919
> [   10.084258] hpage_nr_pages: 1
> [   10.084386] check1: 0
> [   10.084478] check2: 0
> 
> Basically, pte_page(*pvmw->pte) is below pvmw->page, but
> (pte_page(*pvmw->pte) < pvmw->page) doesn't catch it.
> 
> Well, I can see how C lawyer can argue that you can only compare pointers
> of the same memory object which is not the case here. But this is kinda
> insane.
> 
> Any suggestions how to rewrite it in a way that compiler would
> understand?

The patch below makes the crash go away for me.

But this is situation is scary. So we cannot compare arbitrary pointers in
kernel?

Don't we rely on this for lock ordering in some cases? Like in
mutex_lock_double()?

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index d22b84310f6d..1f0f512fd127 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -51,6 +51,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
WARN_ON_ONCE(1);
 #endif
} else {
+   unsigned long ptr1, ptr2;
+
if (is_swap_pte(*pvmw->pte)) {
swp_entry_t entry;
 
@@ -63,12 +65,14 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
if (!pte_present(*pvmw->pte))
return false;
 
-   /* THP can be referenced by any subpage */
-   if (pte_page(*pvmw->pte) - pvmw->page >=
-   hpage_nr_pages(pvmw->page)) {
+   ptr1 = (unsigned long)pte_page(*pvmw->pte);
+   ptr2 = (unsigned long)pvmw->page;
+
+   if (ptr1 < ptr2)
return false;
-   }
-   if (pte_page(*pvmw->pte) < pvmw->page)
+
+   /* THP can be referenced by any subpage */
+   if (ptr1 - ptr2 >= hpage_nr_pages(pvmw->page))
return false;
}
 
-- 
 Kirill A. Shutemov

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-18 Thread Kirill A. Shutemov

On Thu, Jan 18, 2018 at 05:12:45PM +0900, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > OK. I missed the mark. I overlooked that 4.11 already has this problem.
> > 
> > I needed to bisect between 4.10 and 4.11, and I got plausible culprit.
> > 
> > I haven't completed bisecting between b4fb8f66f1ae2e16 and 
> > c470abd4fde40ea6, but
> > b4fb8f66f1ae2e16 ("mm, page_alloc: Add missing check for memory holes") and
> > 13ad59df67f19788 ("mm, page_alloc: avoid page_to_pfn() when merging 
> > buddies")
> > are talking about memory holes, which matches the situation that I'm 
> > trivially
> > hitting the bug if CONFIG_SPARSEMEM=y .
> > 
> > Thus, I call for an attention by speculative execution. ;-)
> 
> Speculative execution failed. I was confused by jiffies precision bug.
> The final culprit is c7ab0d2fdc840266 ("mm: convert try_to_unmap_one() to use 
> page_vma_mapped_walk()").

I think I've tracked it down. check_pte() in mm/page_vma_mapped.c doesn't
work as intended.

I've added instrumentation below to prove it.

The BUG() triggers with following output:

[   10.084024] diff: -858690919
[   10.084258] hpage_nr_pages: 1
[   10.084386] check1: 0
[   10.084478] check2: 0

Basically, pte_page(*pvmw->pte) is below pvmw->page, but
(pte_page(*pvmw->pte) < pvmw->page) doesn't catch it.

Well, I can see how C lawyer can argue that you can only compare pointers
of the same memory object which is not the case here. But this is kinda
insane.

Any suggestions how to rewrite it in a way that compiler would
understand?

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index d22b84310f6d..57b4397f1ea5 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -70,6 +70,14 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
}
if (pte_page(*pvmw->pte) < pvmw->page)
return false;
+
+   if (pte_page(*pvmw->pte) - pvmw->page) {
+   printk("diff: %d\n", pte_page(*pvmw->pte) - pvmw->page);
+   printk("hpage_nr_pages: %d\n", 
hpage_nr_pages(pvmw->page));
+   printk("check1: %d\n", pte_page(*pvmw->pte) - 
pvmw->page < 0);
+   printk("check2: %d\n", pte_page(*pvmw->pte) - 
pvmw->page >= hpage_nr_pages(pvmw->page));
+   BUG();
+   }
}
 
return true;
-- 
 Kirill A. Shutemov

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-18 Thread Tetsuo Handa

Tetsuo Handa wrote:
> OK. I missed the mark. I overlooked that 4.11 already has this problem.
> 
> I needed to bisect between 4.10 and 4.11, and I got plausible culprit.
> 
> I haven't completed bisecting between b4fb8f66f1ae2e16 and c470abd4fde40ea6, 
> but
> b4fb8f66f1ae2e16 ("mm, page_alloc: Add missing check for memory holes") and
> 13ad59df67f19788 ("mm, page_alloc: avoid page_to_pfn() when merging buddies")
> are talking about memory holes, which matches the situation that I'm trivially
> hitting the bug if CONFIG_SPARSEMEM=y .
> 
> Thus, I call for an attention by speculative execution. ;-)

Speculative execution failed. I was confused by jiffies precision bug.
The final culprit is c7ab0d2fdc840266 ("mm: convert try_to_unmap_one() to use 
page_vma_mapped_walk()").


[  103.132986] BUG: unable to handle kernel paging request at b6c00171
[  103.132996] IP: lock_page_memcg+0x3a/0x70
[  103.132997] *pde =  
[  103.132997] 
[  103.132999] Oops:  [#1] SMP DEBUG_PAGEALLOC
[  103.133000] Modules linked in: pcspkr shpchp serio_raw
[  103.133006] CPU: 3 PID: 62 Comm: kswapd0 Not tainted 
4.10.0-09628-gc7ab0d2-dirty #359
[  103.133007] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 07/02/2015
[  103.133008] task: f4559b00 task.stack: f2c74000
[  103.133011] EIP: lock_page_memcg+0x3a/0x70
[  103.133012] EFLAGS: 00010282 CPU: 3
[  103.133013] EAX: f3108060 EBX: b6c1 ECX: 01c8e5a0 EDX: 
[  103.133014] ESI: f4afe5a0 EDI: f3108060 EBP: f2c75c10 ESP: f2c75c04
[  103.133015]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[  103.133017] CR0: 80050033 CR2: b6c00171 CR3: 0170a000 CR4: 000406d0
[  103.133098] Call Trace:
[  103.133104]  page_remove_rmap+0x92/0x260
[  103.133106]  try_to_unmap_one+0x210/0x4b0
[  103.133108]  rmap_walk_file+0xf0/0x200
[  103.133111]  rmap_walk+0x32/0x60
[  103.133112]  try_to_unmap+0x95/0x120
[  103.133114]  ? page_remove_rmap+0x260/0x260
[  103.133116]  ? page_not_mapped+0x10/0x10
[  103.133118]  ? page_get_anon_vma+0x90/0x90
[  103.133120]  shrink_page_list+0x3af/0xc40
[  103.133123]  shrink_inactive_list+0x173/0x370
[  103.133125]  shrink_node_memcg+0x572/0x7d0
[  103.133128]  ? __list_lru_count_one.isra.5+0x14/0x40
[  103.133130]  shrink_node+0xb3/0x2c0
[  103.133132]  kswapd+0x27f/0x5a0
[  103.133137]  kthread+0xd1/0x100
[  103.133139]  ? mem_cgroup_shrink_node+0xa0/0xa0
[  103.133140]  ? kthread_park+0x70/0x70
[  103.133144]  ret_from_fork+0x21/0x2c
[  103.133145] Code: 20 85 db 75 26 eb 2e 66 90 8d b3 74 01 00 00 89 f0 e8 9b 
5a 37 00 3b 5f 20 74 26 89 c2 89 f0 e8 3d 57 37 00 8b 5f 20 85 db 74 0a <8b> 83 
70 01 00 00 85 c0 7f d4 5b 5e 5f 5d f3 c3 8d b6 00 00 00
[  103.133171] EIP: lock_page_memcg+0x3a/0x70 SS:ESP: 0068:f2c75c04
[  103.133172] CR2: b6c00171
[  103.133175] ---[ end trace fa59c5a5ab752d7a ]---



# bad: [86292b33d4b79ee03e2f43ea0381ef85f077c760] Merge branch 'akpm' (patches 
from Andrew)
# good: [13ad59df67f19788f6c22985b1a33e466eceb643] mm, page_alloc: avoid 
page_to_pfn() when merging buddies
git bisect start '86292b33d4b79ee03e2f43ea0381ef85f077c760' '13ad59df67f19788' 
'mm/'
# good: [0262d9c845ec349edf93f69688a5129c36cc2232] memblock: embed memblock 
type name within struct memblock_type
git bisect good 0262d9c845ec349edf93f69688a5129c36cc2232
# good: [0262d9c845ec349edf93f69688a5129c36cc2232] memblock: embed memblock 
type name within struct memblock_type
git bisect good 0262d9c845ec349edf93f69688a5129c36cc2232
# bad: [897ab3e0c49e24b62e2d54d165c7afec6bbca65b] userfaultfd: non-cooperative: 
add event for memory unmaps
git bisect bad 897ab3e0c49e24b62e2d54d165c7afec6bbca65b
# good: [c791ace1e747371658237f0d30234fef56c39669] mm: replace FAULT_FLAG_SIZE 
with parameter to huge_fault
git bisect good c791ace1e747371658237f0d30234fef56c39669
# good: [8eaedede825a02dbe2420b9c9be9b5b2d7515496] mm: fix handling PTE-mapped 
THPs in page_referenced()
git bisect good 8eaedede825a02dbe2420b9c9be9b5b2d7515496
# bad: [36eaff3364e8cd35552a77ee426a8170f4f5fde9] mm, ksm: convert 
write_protect_page() to use page_vma_mapped_walk()
git bisect bad 36eaff3364e8cd35552a77ee426a8170f4f5fde9
# good: [a8fa41ad2f6f7ca08edd1afcf8149ae5a4dcf654] mm, rmap: check all VMAs 
that PTE-mapped THP can be part of
git bisect good a8fa41ad2f6f7ca08edd1afcf8149ae5a4dcf654
# bad: [c7ab0d2fdc840266b39db94538f74207ec2afbf6] mm: convert 
try_to_unmap_one() to use page_vma_mapped_walk()
git bisect bad c7ab0d2fdc840266b39db94538f74207ec2afbf6
# good: [f27176cfc363d395eea8dc5c4a26e5d6d7d65eaf] mm: convert 
page_mkclean_one() to use page_vma_mapped_walk()
git bisect good f27176cfc363d395eea8dc5c4a26e5d6d7d65eaf
# first bad commit: [c7ab0d2fdc840266b39db94538f74207ec2afbf6] mm: convert 
try_to_unmap_one() to use page_vma_mapped_walk()

bad 4.10.0-10531-g86292b3-dirty  # 86292b33d4b79ee0 ("Merge branch 'akpm' 
(patc

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-17 Thread Linus Torvalds

On Wed, Jan 17, 2018 at 2:00 PM, Dave Hansen
 wrote:
>
> I thought that page_zone_id() stuff was there to prevent this kind of
> cross-zone stuff from happening.

Ahh, that was the part I missed. Yeah looks like that checks things
properly. Although the mask generation is *so* confusing that I
stopped following it and will just take your word for it ;)

 Linus

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-17 Thread Dave Hansen

On 01/17/2018 01:51 PM, Linus Torvalds wrote:
> In fact, it seems to be such a fundamental bug that I suspect I'm
> entirely wrong, and full of shit. So it's an interesting and not
> _obviously_ incorrect theory, but I suspect I must be missing
> something.

I'll just note that a few of the pfns I decoded were smack in the middle
of the zone, not near either the high or low end of ZONE_NORMAL where we
would expect this cross-zone stuff to happen.

But I guess we could get similar wonkiness where 'struct page' is
screwed up in so many different ways if during buddy joining you do:

list_del(&buddy->lru);

and 'buddy' is off in another zone for which you do not hold the
spinlock.  If we are somehow missing some locking, or double-allocating
a page, something like this would help:

 static inline void rmv_page_order(struct page *page)
 {
+WARN_ON_ONCE(!PageBuddy(page));
 __ClearPageBuddy(page);
 set_page_private(page, 0);
 }

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-17 Thread Dave Hansen

On 01/17/2018 01:39 PM, Linus Torvalds wrote:
> 
> So maybe something like this to test the theory?
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 76c9688b6a0a..f919a5548943 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -756,6 +756,8 @@ static inline void rmv_page_order(struct page *page)
>  static inline int page_is_buddy(struct page *page, struct page *buddy,
> unsigned int 
> order)
>  {
> +   if (WARN_ON_ONCE(page_zone(page) != page_zone(buddy)))
> +   return 0;
> if (page_is_guard(buddy) && page_order(buddy) == order) {
> if (page_zone_id(page) != page_zone_id(buddy))
> return 0;

I thought that page_zone_id() stuff was there to prevent this kind of
cross-zone stuff from happening.

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-17 Thread Linus Torvalds

On Wed, Jan 17, 2018 at 1:39 PM, Linus Torvalds
 wrote:
>
> In fact, the whole
>
>pfn_valid_within(buddy_pfn)
>
> test looks very odd. Maybe the pfn of the buddy is valid, but it's not
> in the same zone? Then we'd combine the two pages in two different
> zones into one combined page.

It might also be the same allocation zone, but if the pfn's are in
different sparsemem sections that would also be problematic.

But I hope/assume that all sparsemem sections are always aligned to
(PAGE_SIZE << MAXORDER).

In contrast, the ZONE_HIGHMEM limit really does seems to be
potentially not aligned to anything, ie

 arch/x86/include/asm/pgtable_32_types.h:
 #define MAXMEM  (VMALLOC_END - PAGE_OFFSET - __VMALLOC_RESERVE)

which I have no idea what the alignment is, but VMALLOC_END at least
does not seem to have any MAXORDER alignment.

So it really does look like the zone for two page orders that would
otherwise be buddies might actually be different.

Interesting if this really is the case. Because afaik, if that
WARN_ON_ONCE actually triggers, it does seem like this bug could go
back pretty much forever.

In fact, it seems to be such a fundamental bug that I suspect I'm
entirely wrong, and full of shit. So it's an interesting and not
_obviously_ incorrect theory, but I suspect I must be missing
something.

  Linus

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-17 Thread Linus Torvalds

On Wed, Jan 17, 2018 at 3:08 AM, Tetsuo Handa
 wrote:
>
> I needed to bisect between 4.10 and 4.11, and I got plausible culprit.
> [...]
> git bisect bad b4fb8f66f1ae2e167d06c12d018025a8d4d3ba7e
> # first bad commit: [b4fb8f66f1ae2e167d06c12d018025a8d4d3ba7e] mm, 
> page_alloc: Add missing check for memory holes

Ok, that is indeed much more likely, and very much matches the whole
"this problem only happens with sparsemem" issue.

In fact, the whole

   pfn_valid_within(buddy_pfn)

test looks very odd. Maybe the pfn of the buddy is valid, but it's not
in the same zone? Then we'd combine the two pages in two different
zones into one combined page.

Maybe that's why HIGHMEM matters? The low DMA zone is obviously
aligned in the whole PAGE_ORDER range. But the highmem zone might not
be. I used to know the highmem code, but I've happily forgotten
everything. But I think we end up deciding on some random non-aligned
number in the 900MB range as being the limit between the regular zone
and the HIGHMEM zone.

So maybe something like this to test the theory?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 76c9688b6a0a..f919a5548943 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -756,6 +756,8 @@ static inline void rmv_page_order(struct page *page)
 static inline int page_is_buddy(struct page *page, struct page *buddy,
unsigned int order)
 {
+   if (WARN_ON_ONCE(page_zone(page) != page_zone(buddy)))
+   return 0;
if (page_is_guard(buddy) && page_order(buddy) == order) {
if (page_zone_id(page) != page_zone_id(buddy))
return 0;

I don't know. Does that warning trigger for you?

The above is completely untested. It might not compile. If it compiles
it might not work. And even if it "works", it might not matter,
because perhaps the boundary between regular memory and HIGHMEM is
already sufficiently aligned.

Comments?

Linus

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-17 Thread Tetsuo Handa

Linus Torvalds wrote:
> > It turned out that CONFIG_FLATMEM was irrelevant. I just did not hit it.
> 
> So have you actually been able to see the problem with FLATMEM, or is
> this based on the bisect? Because I really think the bisect is pretty
> much guaranteed to be wrong.

Oops, this "it" is "a different bug where bootup of qemu randomly hangs at

[0.001000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[0.003000] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
[0.003000] ...trying to set up timer (IRQ0) through the 8259A ...
[0.003000] . (found apic 0 pin 2) ...
[0.005000] ... failed.
[0.005000] ...trying to set up timer as Virtual Wire IRQ...
[   13.12] random: crng init done

". This bug occurs with both CONFIG_FLATMEM=y or CONFIG_SPARSEMEM=y .



> On Tue, Jan 16, 2018 at 9:33 AM, Tetsuo Handa
>  wrote:
> >
> > Since I got a faster reproducer, I tried full bisection between 4.11 and 
> > 4.12-rc1.
> > But I have no idea why bisection arrives at c0332694903a37cf.
> 
> I don't think your reproducer is 100% reliable.
> 
> And bisection is great because it's very aggressive and optimal when
> it comes to testing. But that also implies that if *any* of the
> good/bad choices were incorrect, then the end result is pure garbage
> and isn't even *close* to the right commit.

OK. I missed the mark. I overlooked that 4.11 already has this problem.

--
[   40.272926] BUG: unable to handle kernel paging request at f6d74b44
[   40.272934] IP: page_remove_rmap+0x7/0x2c0
[   40.272935] *pde = 3732c067 
[   40.272936] *pte = 36d74062 
[   40.272936] 
[   40.272938] Oops:  [#1] SMP DEBUG_PAGEALLOC
[   40.272939] Modules linked in: xfs libcrc32c sr_mod cdrom sd_mod ata_generic 
pata_acpi serio_raw mptspi scsi_transport_spi mptscsih e1000 mptbase ata_piix 
libata
[   40.272952] CPU: 6 PID: 719 Comm: b.out Not tainted 4.11.0 #266
[   40.272952] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 07/02/2015
[   40.272953] task: ef4c1e40 task.stack: ef578000
[   40.272955] EIP: page_remove_rmap+0x7/0x2c0
[   40.272956] EFLAGS: 00010246 CPU: 6
[   40.272957] EAX: f6d74b30 EBX: f6d74b30 ECX:  EDX: 
[   40.272958] ESI: ef7d9640 EDI: 0093 EBP: ef579a78 ESP: ef579a70
[   40.272959]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[   40.272961] CR0: 80050033 CR2: f6d74b44 CR3: 2f4a8000 CR4: 000406d0
[   40.273046] Call Trace:
[   40.273050]  try_to_unmap_one+0x206/0x4f0
[   40.273055]  rmap_walk_file+0x13c/0x270
[   40.273057]  rmap_walk+0x32/0x60
[   40.273058]  try_to_unmap+0xad/0x150
[   40.273060]  ? page_remove_rmap+0x2c0/0x2c0
[   40.273062]  ? page_not_mapped+0x10/0x10
[   40.273063]  ? page_get_anon_vma+0x90/0x90
[   40.273066]  shrink_page_list+0x37a/0xd10
[   40.273069]  shrink_inactive_list+0x173/0x370
[   40.273072]  shrink_node_memcg+0x572/0x7d0
[   40.273074]  ? shrink_slab+0x1a0/0x2e0
[   40.273077]  shrink_node+0xb3/0x2c0
[   40.273079]  do_try_to_free_pages+0xb2/0x2b0
[   40.273081]  try_to_free_pages+0x1a4/0x2f0
[   40.273085]  ? schedule_timeout+0x142/0x200
[   40.273088]  __alloc_pages_slowpath+0x360/0x7e6
[   40.273091]  __alloc_pages_nodemask+0x1a4/0x1b0
[   40.273093]  do_anonymous_page+0xcb/0x500
[   40.273120]  ? xfs_filemap_fault+0x36/0x40 [xfs]
[   40.273122]  handle_mm_fault+0x52f/0x990
[   40.273125]  __do_page_fault+0x19c/0x460
[   40.273127]  ? __do_page_fault+0x460/0x460
[   40.273129]  do_page_fault+0x1a/0x20
[   40.273131]  common_exception+0x6c/0x72
[   40.273133] EIP: 0x8048437
[   40.273133] EFLAGS: 00010202 CPU: 6
[   40.273134] EAX: 001f8000 EBX: 7ff0 ECX: 37803008 EDX: 
[   40.273135] ESI: 7ff0 EDI:  EBP: bfeffa68 ESP: bfeffa30
[   40.273137]  DS: 007b ES: 007b FS:  GS: 0033 SS: 007b
[   40.273137] Code: ff ff ba 78 50 7a c1 89 d8 e8 a6 f8 fe ff 0f 0b 83 e8 01 
e9 66 ff ff ff 8d b6 00 00 00 00 8d bf 00 00 00 00 55 89 e5 56 53 89 c3 <8b> 40 
14 a8 01 0f 85 a4 01 00 00 89 d8 f6 40 04 01 74 5e 84 d2
[   40.273161] EIP: page_remove_rmap+0x7/0x2c0 SS:ESP: 0068:ef579a70
[   40.273162] CR2: f6d74b44
[   40.273164] ---[ end trace 902077077bed43fd ]---
--

I needed to bisect between 4.10 and 4.11, and I got plausible culprit.

--
# bad: [a351e9b9fc24e982ec2f0e76379a49826036da12] Linux 4.11
# good: [c470abd4fde40ea6a0846a2beab642a578c0b8cd] Linux 4.10
# good: [69973b830859bc6529a7a0468ba0d80ee5117826] Linux 4.9
# good: [c8d2bc9bc39ebea8437fd974fdbc21847bb897a3] Linux 4.8
# good: [523d939ef98fd712632d93a5a2b588e477a7565e] Linux 4.7
# good: [2dcd0af568b0cf583645c8a317dd12e344b1c72a] Linux 4.6
# good: [b562e44f507e863c6792946e4e1b1449fbbac85d] Linux 4.5
# good: [afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc] Linux 4.4
# good: [6a13feb9c82803e2b815eca72fa7a9f5561d7861] Linux 4.3
# good: [64291f7db5bd8150a74ad2036f1037e6a0428df2] Linux 4.2
# good: [b953c0d234bc72e8489d3bf51a276c5c4ec85345] Linux 4.1
# good: [39a8804455fb23f09157341d3ba7

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-16 Thread Linus Torvalds

On Tue, Jan 16, 2018 at 9:33 AM, Tetsuo Handa
 wrote:
>
> Since I got a faster reproducer, I tried full bisection between 4.11 and 
> 4.12-rc1.
> But I have no idea why bisection arrives at c0332694903a37cf.

I don't think your reproducer is 100% reliable.

And bisection is great because it's very aggressive and optimal when
it comes to testing. But that also implies that if *any* of the
good/bad choices were incorrect, then the end result is pure garbage
and isn't even *close* to the right commit.

> It turned out that CONFIG_FLATMEM was irrelevant. I just did not hit it.

So have you actually been able to see the problem with FLATMEM, or is
this based on the bisect? Because I really think the bisect is pretty
much guaranteed to be wrong.

   Linus

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-16 Thread Linus Torvalds

On Tue, Jan 16, 2018 at 12:06 AM, Dave Hansen
 wrote:
> On 01/15/2018 06:14 PM, Linus Torvalds wrote:
>> But I'm adding Dave Hansen explicitly to the cc, in case he has any
>> ideas. Not because I blame him, but he's touched the sparsemem code
>> fairly recently, so maybe he'd have some idea on adding sanity
>> checking to the sparsemem version of pfn_to_page().
>
> I swear I haven't touched it lately!

Heh. I did

git blame -C mm/sparse.c | grep 2017

and your name shows up at the beginning a lot because of commit
c4e1be9ec113 ("mm, sparsemem: break out of loops early").

And Michal Hocko (who shows up even more) was already on the cc.

> I'm not sure I'd go after pfn_to_page().  *Maybe* if we were close to
> the places where we've done a pfn_to_page(), but I'm not seeing those.

Fair enough. I just wanted to add debugging, looked at Tetsuo's
config, and went "no way am I adding debugging to the sparsemem case
because it's so confusing".

That said, I also started looking at "kmap_to_page()". That's
something that is *really* different with HIGHMEM, and while most of
the users are in random drivers that do crazy things, I do note that
one of the users is in mm/swap.c.

That thing goes back to commit 5a178119b0fb ("mm: add support for
direct_IO to highmem pages") and was only used for swap_writepage(),
if I read this right.

That swap_writepage() use of kmap()'ed patches was removed some time
later in commit 62a8067a7f35 ("bio_vec-backed iov_iter"), but the
crazy kmap_to_page() thing remained.

I see nothing actively wrong in there, but it really feels like a
"that is all bogus" thing to me.

> Did anyone else notice the
>
> [   31.068198]  ? vmalloc_sync_all+0x150/0x150
>
> present in a bunch of the stack traces?  That should be pretty uncommon.

No, didn't notice that. And yes, vmalloc_sync_all() might be interesting.

>  Is it just part of the normal do_page_fault() stack and the stack
> dumper picks up on it?

I don't think so. It should *not* happen normally. The fact that it
shows up in the trace means it happened that time.

It doesn't seem HIGHMEM-related, though. Or maybe the highmem signal
is bogus too, and it's just that Tetsuo's reproducer needs magical
timing.

 Linus

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-16 Thread Tetsuo Handa

Linus Torvalds wrote:
> On Mon, Jan 15, 2018 at 5:15 PM, Tetsuo Handa
>  wrote:
> >
> > I can't reproduce this with CONFIG_FLATMEM=y . But I'm not sure whether
> > we are hitting a bug in CONFIG_SPARSEMEM=y code, for the bug is highly
> > timing dependent.
> 
> Hmm. Maybe. But sparsemem really also generates *much* more complex
> code particularly for the pfn_to_page() case.

Since I got a faster reproducer, I tried full bisection between 4.11 and 
4.12-rc1.
But I have no idea why bisection arrives at c0332694903a37cf.

--
gcc -Wall -O3 -m32 -o /mnt/a.out -x c - << "EOF"
#include 
#include 

int main(int argc, char *argv[])
{
if (argc != 1) {
unsigned long long size;
char *buf = NULL;
unsigned long long i;
for (size = 1048576; size < 512ULL * (1 << 30); size += 
1048576) {
char *cp = realloc(buf, size);
if (!cp) {
size -= 1048576;
break;
}
buf = cp;
}
for (i = 0; i < size; i += 4096)
buf[i] = 0;
_exit(0);
} else
while (1) {
if (fork() == 0) {
execlp(argv[0], argv[0], "", NULL);
_exit(0);
}
sleep(1);
}
return 0;
}
EOF
--

--
# bad: [2ea659a9ef488125eb46da6eb571de5eae5c43f6] Linux 4.12-rc1
# good: [a351e9b9fc24e982ec2f0e76379a49826036da12] Linux 4.11
git bisect start 'HEAD' 'v4.11'
# good: [221656e7c4ce342b99c31eca96c1cbb6d1dce45f] Merge tag 'sound-4.12-rc1' 
of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect good 221656e7c4ce342b99c31eca96c1cbb6d1dce45f
# good: [c6a677c6f37bb7abc85ba7e3465e82b9f7eb1d91] Merge tag 'staging-4.12-rc1' 
of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
git bisect good c6a677c6f37bb7abc85ba7e3465e82b9f7eb1d91
# bad: [0ff4c01b279a590a2826ade9321ad8c7ca5a1b6c] Merge tag 'armsoc-arm64' of 
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
git bisect bad 0ff4c01b279a590a2826ade9321ad8c7ca5a1b6c
# bad: [8f3207c7eab9d885cc64c778416537034a7d9c5b] Merge tag 'tty-4.12-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
git bisect bad 8f3207c7eab9d885cc64c778416537034a7d9c5b
# bad: [9c6ee01ed5bb1ee489d580eaa60d7eb5a8ede336] Merge branch 'for-linus' of 
git://git.armlinux.org.uk/~rmk/linux-arm
git bisect bad 9c6ee01ed5bb1ee489d580eaa60d7eb5a8ede336
# bad: [fe7a719b30dfdb4d55680461954b99b257ebe671] Merge branch 'for-next' of 
git://git.samba.org/sfrench/cifs-2.6
git bisect bad fe7a719b30dfdb4d55680461954b99b257ebe671
# good: [d557d1b58b3546bab2c5bc2d624c5709840e6b10] refcount: change 
EXPORT_SYMBOL markings
git bisect good d557d1b58b3546bab2c5bc2d624c5709840e6b10
# good: [8f720d9f892e0e223dab8400f03130bc208c72e7] xfs: publish UUID in struct 
super_block
git bisect good 8f720d9f892e0e223dab8400f03130bc208c72e7
# bad: [d173a25165c124442182f6b21d0c2ec381a0eebe] blk-mq: move debugfs 
declarations to a separate header file
git bisect bad d173a25165c124442182f6b21d0c2ec381a0eebe
# bad: [2719aa217e0d025dbfce74ac777815776ccec072] blk-mq: don't use sync 
workqueue flushing from drivers
git bisect bad 2719aa217e0d025dbfce74ac777815776ccec072
# bad: [9f2779bff2f178496fb00b89797734ee245d2c93] blk-mq-sched: remove hack 
that bypasses scheduler for reserved requests
git bisect bad 9f2779bff2f178496fb00b89797734ee245d2c93
# bad: [8afdd94c74e416de74a8ee61d79e4bf93466420b] mtip32xx: kill atomic 
argument to mtip_quiesce_io()
git bisect bad 8afdd94c74e416de74a8ee61d79e4bf93466420b
# bad: [0f6422a2c57c6afcf66ead903dc3fa6641184aa4] mtip32xx: get rid of 'atomic' 
argument to mtip_exec_internal_command()
git bisect bad 0f6422a2c57c6afcf66ead903dc3fa6641184aa4
# bad: [c0332694903a37cf8ecdc9102d5c9e09cf8643d0] block: Remove 
elevator_change()
git bisect bad c0332694903a37cf8ecdc9102d5c9e09cf8643d0
# first bad commit: [c0332694903a37cf8ecdc9102d5c9e09cf8643d0] block: Remove 
elevator_change()
--

I tried different routes from mm/sparse.c , but I feel I can't find the culprit.

--
# bad: [7660a6fddcbae344de8583aa4092071312f110c3] mm: allow slab_nomerge to be 
set at build time
# good: [60a7a88dbb9fc9adcca78a10a3ecf36966b5a45c] mm/sparse: refine 
usemap_size() a little
git bisect start '7660a6fddcbae344de8583aa4092071312f110c3' 
'60a7a88dbb9fc9adcca78a10a3ecf36966b5a45c'
# bad: [9786e34e0a6055dbd1b46e16dfa791ac2b3da289] Merge tag 
'for-linus-20170510' of git://git.infradead.org/linux-mtd
git bisect bad 9786e34e0a6055dbd1b46e16dfa791ac2b3da289
# good: [1062ae4982cabbf60f89b4e069fbb7def7edc8f7] Merge tag 
'drm-forgot-about-tegra-for-v4.12-rc1' of 
git://people.freedesktop.org/~airlied/linux
git bisect good 1062ae4982cabbf60f89b4e069fbb7def7edc8f

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-16 Thread Ingo Molnar

* Dave Hansen  wrote:

> Did anyone else notice the
> 
>   [   31.068198]  ? vmalloc_sync_all+0x150/0x150
> 
> present in a bunch of the stack traces?  That should be pretty uncommon.

I thikn that's pretty unusual:

>  Is it just part of the normal do_page_fault() stack and the stack
> dumper picks up on it?

No, it should only be called by register_die_notifier(), which is not part of 
the 
regular stack dump, AFAICS.

Thanks,

Ingo

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-16 Thread Dave Hansen

On 01/15/2018 06:14 PM, Linus Torvalds wrote:
> But I'm adding Dave Hansen explicitly to the cc, in case he has any
> ideas. Not because I blame him, but he's touched the sparsemem code
> fairly recently, so maybe he'd have some idea on adding sanity
> checking to the sparsemem version of pfn_to_page().

I swear I haven't touched it lately!

I'm not sure I'd go after pfn_to_page().  *Maybe* if we were close to
the places where we've done a pfn_to_page(), but I'm not seeing those.
These, for instance (from the January 5th post) have sane (~500MB) PFNs
and all BUG_ON() because of seeing the page being locked at free:

[  192.152510] BUG: Bad page state in process a.out  pfn:18566
[   77.872133] BUG: Bad page state in process a.out  pfn:1873a
[  188.992549] BUG: Bad page state in process a.out  pfn:197ea

and the page in all those cases came off a list, not out of a pte or
something that would need pfn_to_page().  The page fault path leading up
to the "EIP is at page_cache_tree_insert+0xbe/0xc0" probably doesn't
have a pfn_to_page() anywhere in there at all.

Did anyone else notice the

[   31.068198]  ? vmalloc_sync_all+0x150/0x150

present in a bunch of the stack traces?  That should be pretty uncommon.
 Is it just part of the normal do_page_fault() stack and the stack
dumper picks up on it?

A few things from earlier in this thread:

> [   44.103192] page:5a5a0697 count:-1055023618 mapcount:-1055030029 
> mapping:26f4be11 index:0xc11d7c83
> [   44.103196] flags: 
> 0xc10528fe(waiters|error|referenced|uptodate|dirty|lru|active|reserved|private_2|mappedtodisk|swapbacked)
> [   44.103200] raw: c10528fe c114fff7 c11d7c83 c11d84f2 c11d9dfe c11daa34 
> c11daaa0 c13e65df
> [   44.103201] raw: c13e4a1c c13e4c62
> [   44.103202] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) <= 0)
> [   44.103203] page->mem_cgroup:35401b27

Isn't that 'page:' a non-aligned address in userspace?  It's also weird
that you start dumping out kernel-looking addresses that came from
userspace addresses.  Which VM_SPLIT option are you running with, btw?

I'm still pretty stumped, though.

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-15 Thread Linus Torvalds

On Mon, Jan 15, 2018 at 5:15 PM, Tetsuo Handa
 wrote:
>
> I can't reproduce this with CONFIG_FLATMEM=y . But I'm not sure whether
> we are hitting a bug in CONFIG_SPARSEMEM=y code, for the bug is highly
> timing dependent.

Hmm. Maybe. But sparsemem really also generates *much* more complex
code particularly for the pfn_to_page() case.

It also has much less testing. For example, on x86-64 we do use
sparsemem, but we use the VMEMMAP version of sparsemem: the version
that does *not* play really odd and complex games with that whole
pfn_to_page().

I've always felt like sparsemem was really damn complicated.  The
whole "section_mem_map" encoding is really subtle and odd.

And considering that we're getting what appears to be a invalid page,
in one of the more complicated sequences that very much does that
whole pfn_to_page(), I really wonder.

I wonder if somebody could add some VM_BUG_ON() checks to the
non-vmemmap case of sparsemem in include/asm-generic/memory_model.h.

Because this:

  #define __pfn_to_page(pfn)  \
  ({  unsigned long __pfn = (pfn);\
  struct mem_section *__sec = __pfn_to_section(__pfn);\
  __section_mem_map_addr(__sec) + __pfn;  \
  })

is really subtle, and if we have some case where we pass in an
out-of-range pfn, or some case where we get the section wrong (because
the pfn is between sections or whatever due to some subtle setup bug),
things will really go sideways.

The reason I was hoping you could do this for FLATMEM is that it's
much easier to verify the pfn range in that case.  The sparsemem cases
really makes it much nastier.

That said, all of that code is really old. Most of it goes back to
-05/06 or so. But since you seem to be able to reproduce at least back
to 4.8, I guess this bug does back years too.

But I'm adding Dave Hansen explicitly to the cc, in case he has any
ideas. Not because I blame him, but he's touched the sparsemem code
fairly recently, so maybe he'd have some idea on adding sanity
checking to the sparsemem version of pfn_to_page().

> I dont know why but selecting CONFIG_FLATMEM=y seems to avoid a different bug
> where bootup of qemu randomly fails at

Hmm. That looks very different indeed. But if CONFIG_SPARSEMEM
(presumably together with HIGHMEM) has some odd off-by-one corner case
or similar, who knows *what* issues it could trigger.

 Linus

Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-15 Thread Tetsuo Handa

Linus Torvalds wrote:
> On Sun, Jan 14, 2018 at 3:54 AM, Tetsuo Handa
>  wrote:
> > This memory corruption bug occurs even on CONFIG_SMP=n CONFIG_PREEMPT_NONE=y
> > kernel. This bug highly depends on timing and thus too difficult to bisect.
> > This bug seems to exist at least since Linux 4.8 (judging from the traces, 
> > though
> > the cause might be different). None of debugging configuration gives me a 
> > clue.
> > So far only CONFIG_HIGHMEM=y CONFIG_DEBUG_PAGEALLOC=y kernel (with RAM 
> > enough to
> > use HighMem: zone) seems to hit this bug, but it might be just by chance 
> > caused
> > by timings. Thus, there is no evidence that 64bit kernels are not affected 
> > by
> > this bug. But I can't narrow down any more. Thus, I call for developers who 
> > can
> > narrow down / identify where the memory corruption bug is.
> 
> Hmm.
> 
> I guess I'm still hung up on the "it does not look like a valid
> 'struct page *'" thing.
> 
> Can you reproduce this with CONFIG_FLATMEM=y instead of CONFIG_SPARSEMEM?
> 
> Because if you can, I think we can easily add a few more pfn and
> 'struct page' validation debug statements. With SPARSEMEM, it gets
> pretty complicated because the whole struct page setup is much more
> complex.

I can't reproduce this with CONFIG_FLATMEM=y . But I'm not sure whether
we are hitting a bug in CONFIG_SPARSEMEM=y code, for the bug is highly
timing dependent.

--
# diff .config.old .config
372a373
> CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
462d462
< CONFIG_NEED_NODE_MEMMAP_SIZE=y
468,471c468,471
< # CONFIG_FLATMEM_MANUAL is not set
< CONFIG_SPARSEMEM_MANUAL=y
< CONFIG_SPARSEMEM=y
< CONFIG_HAVE_MEMORY_PRESENT=y
---
> CONFIG_FLATMEM_MANUAL=y
> # CONFIG_SPARSEMEM_MANUAL is not set
> CONFIG_FLATMEM=y
> CONFIG_FLAT_NODE_MEM_MAP=y
478d477
< # CONFIG_MEMORY_HOTPLUG is not set
486a486,487
> CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
> # CONFIG_MEMORY_FAILURE is not set
--

--
[0.00] Linux version 4.15.0-rc8 (root@localhost.localdomain) (gcc 
version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC)) #381 Tue Jan 16 09:38:22 JST 
2018
[0.00] x86/fpu: x87 FPU will use FXSAVE
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x7fffbfff] usable
[0.00] BIOS-e820: [mem 0x7fffc000-0x7fff] reserved
[0.00] BIOS-e820: [mem 0xfffc-0x] reserved
[0.00] Notice: NX (Execute Disable) protection missing in CPU!
[0.00] random: fast init done
[0.00] SMBIOS 2.4 present.
[0.00] DMI: Red Hat KVM, BIOS 0.5.1 01/01/2011
[0.00] e820: last_pfn = 0x7fffc max_arch_pfn = 0x10
[0.00] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- UC
[0.00] found SMP MP-table at [mem 0x000f7300-0x000f730f] mapped at 
[(ptrval)]
[0.00] Scanning 1 areas for low memory corruption
[0.00] ACPI: Early table checksum verification disabled
[0.00] ACPI: RSDP 0x000F7160 14 (v00 BOCHS )
[0.00] ACPI: RSDT 0x7A9B 30 (v01 BOCHS  BXPCRSDT 
0001 BXPC 0001)
[0.00] ACPI: FACP 0x7177 74 (v01 BOCHS  BXPCFACP 
0001 BXPC 0001)
[0.00] ACPI: DSDT 0x7FFFE040 001137 (v01 BOCHS  BXPCDSDT 
0001 BXPC 0001)
[0.00] ACPI: FACS 0x7FFFE000 40
[0.00] ACPI: SSDT 0x71EB 000838 (v01 BOCHS  BXPCSSDT 
0001 BXPC 0001)
[0.00] ACPI: APIC 0x7A23 78 (v01 BOCHS  BXPCAPIC 
0001 BXPC 0001)
[0.00] 1159MB HIGHMEM available.
[0.00] 887MB LOWMEM available.
[0.00]   mapped low ram: 0 - 377fe000
[0.00]   low ram: 0 - 377fe000
[0.00] tsc: Fast TSC calibration using PIT
[0.00] Zone ranges:
[0.00]   DMA  [mem 0x1000-0x00ff]
[0.00]   Normal   [mem 0x0100-0x377fdfff]
[0.00]   HighMem  [mem 0x377fe000-0x7fffbfff]
[0.00] Movable zone start for each node
[0.00] Early memory node ranges
[0.00]   node   0: [mem 0x1000-0x0009efff]
[0.00]   node   0: [mem 0x0010-0x7fffbfff]
[0.00] Initmem setup node 0 [mem 0x1000-0x7fffbfff]
[0.00] Reserved but unavailable: 98 pages
[0.00] Using APIC driver default
[0.00] ACPI: PM-Timer IO Port: 0x608
[0.00] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
[0.00] IOAPIC[0]: apic_id 0 already used, trying 1
[0.00] IOAPIC[0]: apic_id 1, version 17, address 0xfec0, GSI 0-23
[0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[

45 matches

Mail list logo