Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-18 Thread Joonsoo Kim
On Tue, Jul 16, 2013 at 08:01:46PM +1000, David Gibson wrote:
> On Tue, Jul 16, 2013 at 02:34:24PM +0900, Joonsoo Kim wrote:
> > On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote:
> > > On 07/15/2013 03:24 AM, David Gibson wrote:
> > > >On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:
> > > 
> > > >>>Reading the existing comment, this change looks very suspicious to me.
> > > >>>A per-vma mutex is just not going to provide the necessary exclusion, 
> > > >>>is
> > > >>>it?  (But I recall next to nothing about these regions and
> > > >>>reservations.)
> > > >
> > > >A per-VMA lock is definitely wrong.  I think it handles one form of
> > > >the race, between threads sharing a VM on a MAP_PRIVATE mapping.
> > > >However another form of the race can and does occur between different
> > > >MAP_SHARED VMAs in the same or different processes.  I think there may
> > > >be edge cases involving mremap() and MAP_PRIVATE that will also be
> > > >missed by a per-VMA lock.
> > > >
> > > >Note that the libhugetlbfs testsuite contains tests for both PRIVATE
> > > >and SHARED variants of the race.
> > > 
> > > Can we get away with simply using a mutex in the file?
> > > Say vma->vm_file->mapping->i_mmap_mutex?
> > 
> > I totally agree with this approach :)
> > 
> > > 
> > > That might help with multiple processes initializing
> > > multiple shared memory segments at the same time, and
> > > should not hurt the case of a process mapping its own
> > > hugetlbfs area.
> > > 
> > > It might have the potential to hurt when getting private
> > > copies on a MAP_PRIVATE area, though.  I have no idea
> > > how common it is for multiple processes to MAP_PRIVATE
> > > the same hugetlbfs file, though...
> > 
> > Currently, getting private copies on a MAP_PRIVATE area is also
> > serialized by hugetlb_instantiation_mutex.
> > How do we get worse with your approach?
> > 
> > BTW, we have one race problem related to hugetlb_instantiation_mutex.
> > It is not right protection for region structure handling. We map the
> > area without holding a hugetlb_instantiation_mutex, so there is
> > race condition between mapping a new area and faulting the other area.
> > Am I missing?
> 
> The hugetlb_instantiation_mutex has nothing to do with protecting
> region structures.  It exists only to address one very specific and
> frequently misunderstood race.

Yes, it was introduced for that purpose, but, currently, it is also
used for protecting region structure. You can see below comment in
mm/hugetlb.c

 * The region data structures are protected by a combination of the mmap_sem
 * and the hugetlb_instantion_mutex.  To access or modify a region the caller
 * must either hold the mmap_sem for write, or the mmap_sem for read and
 * the hugetlb_instantiation mutex:
 *
 *  down_write(>mmap_sem);
 * or
 *  down_read(>mmap_sem);
 *  mutex_lock(_instantiation_mutex);
 */

Thanks.

> 
> -- 
> David Gibson  | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au| minimalist, thank you.  NOT _the_ 
> _other_
>   | _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-18 Thread Joonsoo Kim
On Tue, Jul 16, 2013 at 08:01:46PM +1000, David Gibson wrote:
 On Tue, Jul 16, 2013 at 02:34:24PM +0900, Joonsoo Kim wrote:
  On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote:
   On 07/15/2013 03:24 AM, David Gibson wrote:
   On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:
   
   Reading the existing comment, this change looks very suspicious to me.
   A per-vma mutex is just not going to provide the necessary exclusion, 
   is
   it?  (But I recall next to nothing about these regions and
   reservations.)
   
   A per-VMA lock is definitely wrong.  I think it handles one form of
   the race, between threads sharing a VM on a MAP_PRIVATE mapping.
   However another form of the race can and does occur between different
   MAP_SHARED VMAs in the same or different processes.  I think there may
   be edge cases involving mremap() and MAP_PRIVATE that will also be
   missed by a per-VMA lock.
   
   Note that the libhugetlbfs testsuite contains tests for both PRIVATE
   and SHARED variants of the race.
   
   Can we get away with simply using a mutex in the file?
   Say vma-vm_file-mapping-i_mmap_mutex?
  
  I totally agree with this approach :)
  
   
   That might help with multiple processes initializing
   multiple shared memory segments at the same time, and
   should not hurt the case of a process mapping its own
   hugetlbfs area.
   
   It might have the potential to hurt when getting private
   copies on a MAP_PRIVATE area, though.  I have no idea
   how common it is for multiple processes to MAP_PRIVATE
   the same hugetlbfs file, though...
  
  Currently, getting private copies on a MAP_PRIVATE area is also
  serialized by hugetlb_instantiation_mutex.
  How do we get worse with your approach?
  
  BTW, we have one race problem related to hugetlb_instantiation_mutex.
  It is not right protection for region structure handling. We map the
  area without holding a hugetlb_instantiation_mutex, so there is
  race condition between mapping a new area and faulting the other area.
  Am I missing?
 
 The hugetlb_instantiation_mutex has nothing to do with protecting
 region structures.  It exists only to address one very specific and
 frequently misunderstood race.

Yes, it was introduced for that purpose, but, currently, it is also
used for protecting region structure. You can see below comment in
mm/hugetlb.c

 * The region data structures are protected by a combination of the mmap_sem
 * and the hugetlb_instantion_mutex.  To access or modify a region the caller
 * must either hold the mmap_sem for write, or the mmap_sem for read and
 * the hugetlb_instantiation mutex:
 *
 *  down_write(mm-mmap_sem);
 * or
 *  down_read(mm-mmap_sem);
 *  mutex_lock(hugetlb_instantiation_mutex);
 */

Thanks.

 
 -- 
 David Gibson  | I'll have my music baroque, and my code
 david AT gibson.dropbear.id.au| minimalist, thank you.  NOT _the_ 
 _other_
   | _way_ _around_!
 http://www.ozlabs.org/~dgibson


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-16 Thread David Gibson
On Tue, Jul 16, 2013 at 02:34:24PM +0900, Joonsoo Kim wrote:
> On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote:
> > On 07/15/2013 03:24 AM, David Gibson wrote:
> > >On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:
> > 
> > >>>Reading the existing comment, this change looks very suspicious to me.
> > >>>A per-vma mutex is just not going to provide the necessary exclusion, is
> > >>>it?  (But I recall next to nothing about these regions and
> > >>>reservations.)
> > >
> > >A per-VMA lock is definitely wrong.  I think it handles one form of
> > >the race, between threads sharing a VM on a MAP_PRIVATE mapping.
> > >However another form of the race can and does occur between different
> > >MAP_SHARED VMAs in the same or different processes.  I think there may
> > >be edge cases involving mremap() and MAP_PRIVATE that will also be
> > >missed by a per-VMA lock.
> > >
> > >Note that the libhugetlbfs testsuite contains tests for both PRIVATE
> > >and SHARED variants of the race.
> > 
> > Can we get away with simply using a mutex in the file?
> > Say vma->vm_file->mapping->i_mmap_mutex?
> 
> I totally agree with this approach :)
> 
> > 
> > That might help with multiple processes initializing
> > multiple shared memory segments at the same time, and
> > should not hurt the case of a process mapping its own
> > hugetlbfs area.
> > 
> > It might have the potential to hurt when getting private
> > copies on a MAP_PRIVATE area, though.  I have no idea
> > how common it is for multiple processes to MAP_PRIVATE
> > the same hugetlbfs file, though...
> 
> Currently, getting private copies on a MAP_PRIVATE area is also
> serialized by hugetlb_instantiation_mutex.
> How do we get worse with your approach?
> 
> BTW, we have one race problem related to hugetlb_instantiation_mutex.
> It is not right protection for region structure handling. We map the
> area without holding a hugetlb_instantiation_mutex, so there is
> race condition between mapping a new area and faulting the other area.
> Am I missing?

The hugetlb_instantiation_mutex has nothing to do with protecting
region structures.  It exists only to address one very specific and
frequently misunderstood race.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpT1Y7aTSAX2.pgp
Description: PGP signature


Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-16 Thread David Gibson
On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote:
> On 07/15/2013 03:24 AM, David Gibson wrote:
> >On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:
> 
> >>>Reading the existing comment, this change looks very suspicious to me.
> >>>A per-vma mutex is just not going to provide the necessary exclusion, is
> >>>it?  (But I recall next to nothing about these regions and
> >>>reservations.)
> >
> >A per-VMA lock is definitely wrong.  I think it handles one form of
> >the race, between threads sharing a VM on a MAP_PRIVATE mapping.
> >However another form of the race can and does occur between different
> >MAP_SHARED VMAs in the same or different processes.  I think there may
> >be edge cases involving mremap() and MAP_PRIVATE that will also be
> >missed by a per-VMA lock.
> >
> >Note that the libhugetlbfs testsuite contains tests for both PRIVATE
> >and SHARED variants of the race.
> 
> Can we get away with simply using a mutex in the file?
> Say vma->vm_file->mapping->i_mmap_mutex?

So I don't know the VM well enough to know if this could conflict with
other usages of i_mmap_mutex.  But unfortunately, whether or not its
otherwise correct that approach won't address the scalability issue at
hand here.

In the case where the race matters, we're always dealing with the same
file.  Otherwise, we'd end up with a genuine, rather than spurious,
out-of-memory error, regardless of how the race turned out.

So in the case with the performance bottleneck we're considering, the
i_mmap_mutex approach degenerates to serialization on a single mutex,
just as before.

In order to improve scalability, we need to consider which page within
the file we're instantiating which is what the hash function in my
patch does.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpMTETtH4O71.pgp
Description: PGP signature


Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-16 Thread David Gibson
On Mon, Jul 15, 2013 at 05:12:31PM -0700, Davidlohr Bueso wrote:
> On Mon, 2013-07-15 at 16:08 -0700, Andrew Morton wrote:
> > On Mon, 15 Jul 2013 17:24:32 +1000 David Gibson 
> >  wrote:
> > 
> > > I have previously proposed a correct method of improving scalability,
> > > although it doesn't eliminate the lock.  That's to use a set of hashed
> > > mutexes.
> > 
> > Yep - hashing the mutexes is an obvious and nicely localized way of
> > improving this.  It's a tweak, not a design change.
> > 
> > The changelog should describe the choice of the hash key with great
> > precision, please.  It's important and is the first thing which
> > reviewers and readers will zoom in on.

Yeah, that is important.

I no longer have much interest in the result of this patch, so I'll
leave it to others to do the forward port and cleanup.

But I will point out the gotcha here is that the hash key needs to be
based on (address_space & file offset) for MAP_SHARED, but (mm &
address) for MAP_PRIVATE.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpNiJwXnZCrY.pgp
Description: PGP signature


Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-16 Thread David Gibson
On Mon, Jul 15, 2013 at 05:12:31PM -0700, Davidlohr Bueso wrote:
 On Mon, 2013-07-15 at 16:08 -0700, Andrew Morton wrote:
  On Mon, 15 Jul 2013 17:24:32 +1000 David Gibson 
  da...@gibson.dropbear.id.au wrote:
  
   I have previously proposed a correct method of improving scalability,
   although it doesn't eliminate the lock.  That's to use a set of hashed
   mutexes.
  
  Yep - hashing the mutexes is an obvious and nicely localized way of
  improving this.  It's a tweak, not a design change.
  
  The changelog should describe the choice of the hash key with great
  precision, please.  It's important and is the first thing which
  reviewers and readers will zoom in on.

Yeah, that is important.

I no longer have much interest in the result of this patch, so I'll
leave it to others to do the forward port and cleanup.

But I will point out the gotcha here is that the hash key needs to be
based on (address_space  file offset) for MAP_SHARED, but (mm 
address) for MAP_PRIVATE.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpNiJwXnZCrY.pgp
Description: PGP signature


Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-16 Thread David Gibson
On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote:
 On 07/15/2013 03:24 AM, David Gibson wrote:
 On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:
 
 Reading the existing comment, this change looks very suspicious to me.
 A per-vma mutex is just not going to provide the necessary exclusion, is
 it?  (But I recall next to nothing about these regions and
 reservations.)
 
 A per-VMA lock is definitely wrong.  I think it handles one form of
 the race, between threads sharing a VM on a MAP_PRIVATE mapping.
 However another form of the race can and does occur between different
 MAP_SHARED VMAs in the same or different processes.  I think there may
 be edge cases involving mremap() and MAP_PRIVATE that will also be
 missed by a per-VMA lock.
 
 Note that the libhugetlbfs testsuite contains tests for both PRIVATE
 and SHARED variants of the race.
 
 Can we get away with simply using a mutex in the file?
 Say vma-vm_file-mapping-i_mmap_mutex?

So I don't know the VM well enough to know if this could conflict with
other usages of i_mmap_mutex.  But unfortunately, whether or not its
otherwise correct that approach won't address the scalability issue at
hand here.

In the case where the race matters, we're always dealing with the same
file.  Otherwise, we'd end up with a genuine, rather than spurious,
out-of-memory error, regardless of how the race turned out.

So in the case with the performance bottleneck we're considering, the
i_mmap_mutex approach degenerates to serialization on a single mutex,
just as before.

In order to improve scalability, we need to consider which page within
the file we're instantiating which is what the hash function in my
patch does.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpMTETtH4O71.pgp
Description: PGP signature


Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-16 Thread David Gibson
On Tue, Jul 16, 2013 at 02:34:24PM +0900, Joonsoo Kim wrote:
 On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote:
  On 07/15/2013 03:24 AM, David Gibson wrote:
  On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:
  
  Reading the existing comment, this change looks very suspicious to me.
  A per-vma mutex is just not going to provide the necessary exclusion, is
  it?  (But I recall next to nothing about these regions and
  reservations.)
  
  A per-VMA lock is definitely wrong.  I think it handles one form of
  the race, between threads sharing a VM on a MAP_PRIVATE mapping.
  However another form of the race can and does occur between different
  MAP_SHARED VMAs in the same or different processes.  I think there may
  be edge cases involving mremap() and MAP_PRIVATE that will also be
  missed by a per-VMA lock.
  
  Note that the libhugetlbfs testsuite contains tests for both PRIVATE
  and SHARED variants of the race.
  
  Can we get away with simply using a mutex in the file?
  Say vma-vm_file-mapping-i_mmap_mutex?
 
 I totally agree with this approach :)
 
  
  That might help with multiple processes initializing
  multiple shared memory segments at the same time, and
  should not hurt the case of a process mapping its own
  hugetlbfs area.
  
  It might have the potential to hurt when getting private
  copies on a MAP_PRIVATE area, though.  I have no idea
  how common it is for multiple processes to MAP_PRIVATE
  the same hugetlbfs file, though...
 
 Currently, getting private copies on a MAP_PRIVATE area is also
 serialized by hugetlb_instantiation_mutex.
 How do we get worse with your approach?
 
 BTW, we have one race problem related to hugetlb_instantiation_mutex.
 It is not right protection for region structure handling. We map the
 area without holding a hugetlb_instantiation_mutex, so there is
 race condition between mapping a new area and faulting the other area.
 Am I missing?

The hugetlb_instantiation_mutex has nothing to do with protecting
region structures.  It exists only to address one very specific and
frequently misunderstood race.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpT1Y7aTSAX2.pgp
Description: PGP signature


Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-15 Thread Joonsoo Kim
On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote:
> On 07/15/2013 03:24 AM, David Gibson wrote:
> >On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:
> 
> >>>Reading the existing comment, this change looks very suspicious to me.
> >>>A per-vma mutex is just not going to provide the necessary exclusion, is
> >>>it?  (But I recall next to nothing about these regions and
> >>>reservations.)
> >
> >A per-VMA lock is definitely wrong.  I think it handles one form of
> >the race, between threads sharing a VM on a MAP_PRIVATE mapping.
> >However another form of the race can and does occur between different
> >MAP_SHARED VMAs in the same or different processes.  I think there may
> >be edge cases involving mremap() and MAP_PRIVATE that will also be
> >missed by a per-VMA lock.
> >
> >Note that the libhugetlbfs testsuite contains tests for both PRIVATE
> >and SHARED variants of the race.
> 
> Can we get away with simply using a mutex in the file?
> Say vma->vm_file->mapping->i_mmap_mutex?

I totally agree with this approach :)

> 
> That might help with multiple processes initializing
> multiple shared memory segments at the same time, and
> should not hurt the case of a process mapping its own
> hugetlbfs area.
> 
> It might have the potential to hurt when getting private
> copies on a MAP_PRIVATE area, though.  I have no idea
> how common it is for multiple processes to MAP_PRIVATE
> the same hugetlbfs file, though...

Currently, getting private copies on a MAP_PRIVATE area is also
serialized by hugetlb_instantiation_mutex.
How do we get worse with your approach?

BTW, we have one race problem related to hugetlb_instantiation_mutex.
It is not right protection for region structure handling. We map the
area without holding a hugetlb_instantiation_mutex, so there is
race condition between mapping a new area and faulting the other area.
Am I missing?

Thanks.

> 
> -- 
> All rights reversed
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-15 Thread Rik van Riel

On 07/15/2013 03:24 AM, David Gibson wrote:

On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:



Reading the existing comment, this change looks very suspicious to me.
A per-vma mutex is just not going to provide the necessary exclusion, is
it?  (But I recall next to nothing about these regions and
reservations.)


A per-VMA lock is definitely wrong.  I think it handles one form of
the race, between threads sharing a VM on a MAP_PRIVATE mapping.
However another form of the race can and does occur between different
MAP_SHARED VMAs in the same or different processes.  I think there may
be edge cases involving mremap() and MAP_PRIVATE that will also be
missed by a per-VMA lock.

Note that the libhugetlbfs testsuite contains tests for both PRIVATE
and SHARED variants of the race.


Can we get away with simply using a mutex in the file?
Say vma->vm_file->mapping->i_mmap_mutex?

That might help with multiple processes initializing
multiple shared memory segments at the same time, and
should not hurt the case of a process mapping its own
hugetlbfs area.

It might have the potential to hurt when getting private
copies on a MAP_PRIVATE area, though.  I have no idea
how common it is for multiple processes to MAP_PRIVATE
the same hugetlbfs file, though...

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-15 Thread Davidlohr Bueso
On Mon, 2013-07-15 at 16:08 -0700, Andrew Morton wrote:
> On Mon, 15 Jul 2013 17:24:32 +1000 David Gibson  
> wrote:
> 
> > I have previously proposed a correct method of improving scalability,
> > although it doesn't eliminate the lock.  That's to use a set of hashed
> > mutexes.
> 
> Yep - hashing the mutexes is an obvious and nicely localized way of
> improving this.  It's a tweak, not a design change.
> 
> The changelog should describe the choice of the hash key with great
> precision, please.  It's important and is the first thing which
> reviewers and readers will zoom in on.
> 
> Should the individual mutexes be cacheline aligned?  Depends on the
> acquisition frequency, I guess.  Please let's work through that.

In my test cases, involving different RDBMS, I'm getting around 114k
acquisitions.

> 
> Let's not damage uniprocesor kernels too much.  AFACIT the main offender
> here is fault_mutex_hash(), which is the world's most obfuscated "return
> 0;".

I guess we could add an ifndef CONFIG_SMP check to the function and
return 0 right away. That would eliminate any overhead in
fault_mutex_hash().

> 
> >  It wasn't merged before, but I don't recall the reasons
> > why. 

So I've forward ported the patch (will send once everyone agrees that
the matter is settled), including the changes Anton Blanchard added a
exactly two years ago:

https://lkml.org/lkml/2011/7/15/31

My tests show that the number of lock contentions drops from ~11k to
around 500. So this approach alleviates a lot of the bottleneck. I've
also ran it against libhugetlbfs without any regressions.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-15 Thread Andrew Morton
On Mon, 15 Jul 2013 17:24:32 +1000 David Gibson  
wrote:

> I have previously proposed a correct method of improving scalability,
> although it doesn't eliminate the lock.  That's to use a set of hashed
> mutexes.

Yep - hashing the mutexes is an obvious and nicely localized way of
improving this.  It's a tweak, not a design change.

The changelog should describe the choice of the hash key with great
precision, please.  It's important and is the first thing which
reviewers and readers will zoom in on.

Should the individual mutexes be cacheline aligned?  Depends on the
acquisition frequency, I guess.  Please let's work through that.

Let's not damage uniprocesor kernels too much.  AFACIT the main offender
here is fault_mutex_hash(), which is the world's most obfuscated "return
0;".

>  It wasn't merged before, but I don't recall the reasons
> why. 

Me either.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-15 Thread David Gibson
On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:
> On Fri, 2013-07-12 at 17:54 -0700, Hugh Dickins wrote:
> > Adding the essential David Gibson to the Cc list.
> > 
> > On Fri, 12 Jul 2013, Davidlohr Bueso wrote:
> > 
> > > The hugetlb_instantiation_mutex serializes hugepage allocation and 
> > > instantiation
> > > in the page directory entry. It was found that this mutex can become 
> > > quite contended
> > > during the early phases of large databases which make use of huge pages - 
> > > for instance
> > > startup and initial runs. One clear example is a 1.5Gb Oracle database, 
> > > where lockstat
> > > reports that this mutex can be one of the top 5 most contended locks in 
> > > the kernel during
> > > the first few minutes:
> > > 
> > > hugetlb_instantiation_mutex:  10678 10678
> > >  ---
> > >  hugetlb_instantiation_mutex10678  [] 
> > > hugetlb_fault+0x9e/0x340
> > >  ---
> > >  hugetlb_instantiation_mutex10678  [] 
> > > hugetlb_fault+0x9e/0x340
> > > 
> > > contentions:  10678
> > > acquisitions: 99476
> > > waittime-total: 76888911.01 us
> > > 
> > > Instead of serializing each hugetlb fault, we can deal with concurrent 
> > > faults for pages
> > > in different vmas. The per-vma mutex is initialized when creating a new 
> > > vma. So, back to
> > > the example above, we now get much less contention:
> > > 
> > >  >hugetlb_instantiation_mutex:  1 1
> > >-
> > >>hugetlb_instantiation_mutex   1   [] 
> > > hugetlb_fault+0xa6/0x350
> > >-
> > >>hugetlb_instantiation_mutex   1[] 
> > > hugetlb_fault+0xa6/0x350
> > > 
> > > contentions:  1
> > > acquisitions:108092
> > > waittime-total:  621.24 us
> > > 
> > > Signed-off-by: Davidlohr Bueso 
> > 
> > I agree this is a problem worth solving,
> > but I doubt this patch is the right solution.

It's not.

> > > ---
> > >  include/linux/mm_types.h |  3 +++
> > >  mm/hugetlb.c | 12 +---
> > >  mm/mmap.c|  3 +++
> > >  3 files changed, 11 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index fb425aa..b45fd87 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -289,6 +289,9 @@ struct vm_area_struct {
> > >  #ifdef CONFIG_NUMA
> > >   struct mempolicy *vm_policy;/* NUMA policy for the VMA */
> > >  #endif
> > > +#ifdef CONFIG_HUGETLB_PAGE
> > > + struct mutex hugetlb_instantiation_mutex;
> > > +#endif
> > >  };
> > 
> > Bloating every vm_area_struct with a rarely useful mutex:
> > I'm sure you can construct cases where per-vma mutex would win over
> > per-mm mutex, but they will have to be very common to justify the bloat.
> > 
> 
> I cannot disagree here, this was my main concern about this patch, and,
> as you mentioned, if we can just get rid of the need for the lock, much
> better.

So, by all means try to get rid of the mutex, but doing so is
surprisingly difficult - I made several failed attempts a long time
ago, before giving up and creating the mutex in the first place.

Note that the there is no analogous handling in the normal page case,
because for normal pages we always assume we have a few pages of
"slush" and can temporarily overallocate without problems.  Because
hugepages are a more precious resource, it's usual to want to allocate
and use every single one - spurious OOMs when racing to allocate the
very last hugepage are what the mutex protects against.

> > >  struct core_thread {
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 83aff0a..12e665b 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -137,12 +137,12 @@ static inline struct hugepage_subpool 
> > > *subpool_vma(struct vm_area_struct *vma)
> > >   * The region data structures are protected by a combination of the 
> > > mmap_sem
> > >   * and the hugetlb_instantion_mutex.  To access or modify a region the 
> > > caller
> > >   * must either hold the mmap_sem for write, or the mmap_sem for read and
> > > - * the hugetlb_instantiation mutex:
> > > + * the vma's hugetlb_instantiation mutex:
> > 
> > Reading the existing comment, this change looks very suspicious to me.
> > A per-vma mutex is just not going to provide the necessary exclusion, is
> > it?  (But I recall next to nothing about these regions and
> > reservations.)

A per-VMA lock is definitely wrong.  I think it handles one form of
the race, between threads sharing a VM on a MAP_PRIVATE mapping.
However another form of the race can and does occur between different
MAP_SHARED VMAs in the same or different processes.  I think there may
be edge cases involving mremap() and MAP_PRIVATE that will also be
missed by a per-VMA lock.

Note that the libhugetlbfs testsuite contains tests for both 

Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-15 Thread David Gibson
On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:
 On Fri, 2013-07-12 at 17:54 -0700, Hugh Dickins wrote:
  Adding the essential David Gibson to the Cc list.
  
  On Fri, 12 Jul 2013, Davidlohr Bueso wrote:
  
   The hugetlb_instantiation_mutex serializes hugepage allocation and 
   instantiation
   in the page directory entry. It was found that this mutex can become 
   quite contended
   during the early phases of large databases which make use of huge pages - 
   for instance
   startup and initial runs. One clear example is a 1.5Gb Oracle database, 
   where lockstat
   reports that this mutex can be one of the top 5 most contended locks in 
   the kernel during
   the first few minutes:
   
   hugetlb_instantiation_mutex:  10678 10678
---
hugetlb_instantiation_mutex10678  [8115e14e] 
   hugetlb_fault+0x9e/0x340
---
hugetlb_instantiation_mutex10678  [8115e14e] 
   hugetlb_fault+0x9e/0x340
   
   contentions:  10678
   acquisitions: 99476
   waittime-total: 76888911.01 us
   
   Instead of serializing each hugetlb fault, we can deal with concurrent 
   faults for pages
   in different vmas. The per-vma mutex is initialized when creating a new 
   vma. So, back to
   the example above, we now get much less contention:
   
vma-hugetlb_instantiation_mutex:  1 1
  -
  vma-hugetlb_instantiation_mutex   1   [8115e216] 
   hugetlb_fault+0xa6/0x350
  -
  vma-hugetlb_instantiation_mutex   1[8115e216] 
   hugetlb_fault+0xa6/0x350
   
   contentions:  1
   acquisitions:108092
   waittime-total:  621.24 us
   
   Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
  
  I agree this is a problem worth solving,
  but I doubt this patch is the right solution.

It's not.

   ---
include/linux/mm_types.h |  3 +++
mm/hugetlb.c | 12 +---
mm/mmap.c|  3 +++
3 files changed, 11 insertions(+), 7 deletions(-)
   
   diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
   index fb425aa..b45fd87 100644
   --- a/include/linux/mm_types.h
   +++ b/include/linux/mm_types.h
   @@ -289,6 +289,9 @@ struct vm_area_struct {
#ifdef CONFIG_NUMA
 struct mempolicy *vm_policy;/* NUMA policy for the VMA */
#endif
   +#ifdef CONFIG_HUGETLB_PAGE
   + struct mutex hugetlb_instantiation_mutex;
   +#endif
};
  
  Bloating every vm_area_struct with a rarely useful mutex:
  I'm sure you can construct cases where per-vma mutex would win over
  per-mm mutex, but they will have to be very common to justify the bloat.
  
 
 I cannot disagree here, this was my main concern about this patch, and,
 as you mentioned, if we can just get rid of the need for the lock, much
 better.

So, by all means try to get rid of the mutex, but doing so is
surprisingly difficult - I made several failed attempts a long time
ago, before giving up and creating the mutex in the first place.

Note that the there is no analogous handling in the normal page case,
because for normal pages we always assume we have a few pages of
slush and can temporarily overallocate without problems.  Because
hugepages are a more precious resource, it's usual to want to allocate
and use every single one - spurious OOMs when racing to allocate the
very last hugepage are what the mutex protects against.

struct core_thread {
   diff --git a/mm/hugetlb.c b/mm/hugetlb.c
   index 83aff0a..12e665b 100644
   --- a/mm/hugetlb.c
   +++ b/mm/hugetlb.c
   @@ -137,12 +137,12 @@ static inline struct hugepage_subpool 
   *subpool_vma(struct vm_area_struct *vma)
 * The region data structures are protected by a combination of the 
   mmap_sem
 * and the hugetlb_instantion_mutex.  To access or modify a region the 
   caller
 * must either hold the mmap_sem for write, or the mmap_sem for read and
   - * the hugetlb_instantiation mutex:
   + * the vma's hugetlb_instantiation mutex:
  
  Reading the existing comment, this change looks very suspicious to me.
  A per-vma mutex is just not going to provide the necessary exclusion, is
  it?  (But I recall next to nothing about these regions and
  reservations.)

A per-VMA lock is definitely wrong.  I think it handles one form of
the race, between threads sharing a VM on a MAP_PRIVATE mapping.
However another form of the race can and does occur between different
MAP_SHARED VMAs in the same or different processes.  I think there may
be edge cases involving mremap() and MAP_PRIVATE that will also be
missed by a per-VMA lock.

Note that the libhugetlbfs testsuite contains tests for both PRIVATE
and SHARED variants of the race.

 *
 *   down_write(mm-mmap_sem);
 * or
 *   down_read(mm-mmap_sem);
   - *   

Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-15 Thread Andrew Morton
On Mon, 15 Jul 2013 17:24:32 +1000 David Gibson da...@gibson.dropbear.id.au 
wrote:

 I have previously proposed a correct method of improving scalability,
 although it doesn't eliminate the lock.  That's to use a set of hashed
 mutexes.

Yep - hashing the mutexes is an obvious and nicely localized way of
improving this.  It's a tweak, not a design change.

The changelog should describe the choice of the hash key with great
precision, please.  It's important and is the first thing which
reviewers and readers will zoom in on.

Should the individual mutexes be cacheline aligned?  Depends on the
acquisition frequency, I guess.  Please let's work through that.

Let's not damage uniprocesor kernels too much.  AFACIT the main offender
here is fault_mutex_hash(), which is the world's most obfuscated return
0;.

  It wasn't merged before, but I don't recall the reasons
 why. 

Me either.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-15 Thread Davidlohr Bueso
On Mon, 2013-07-15 at 16:08 -0700, Andrew Morton wrote:
 On Mon, 15 Jul 2013 17:24:32 +1000 David Gibson da...@gibson.dropbear.id.au 
 wrote:
 
  I have previously proposed a correct method of improving scalability,
  although it doesn't eliminate the lock.  That's to use a set of hashed
  mutexes.
 
 Yep - hashing the mutexes is an obvious and nicely localized way of
 improving this.  It's a tweak, not a design change.
 
 The changelog should describe the choice of the hash key with great
 precision, please.  It's important and is the first thing which
 reviewers and readers will zoom in on.
 
 Should the individual mutexes be cacheline aligned?  Depends on the
 acquisition frequency, I guess.  Please let's work through that.

In my test cases, involving different RDBMS, I'm getting around 114k
acquisitions.

 
 Let's not damage uniprocesor kernels too much.  AFACIT the main offender
 here is fault_mutex_hash(), which is the world's most obfuscated return
 0;.

I guess we could add an ifndef CONFIG_SMP check to the function and
return 0 right away. That would eliminate any overhead in
fault_mutex_hash().

 
   It wasn't merged before, but I don't recall the reasons
  why. 

So I've forward ported the patch (will send once everyone agrees that
the matter is settled), including the changes Anton Blanchard added a
exactly two years ago:

https://lkml.org/lkml/2011/7/15/31

My tests show that the number of lock contentions drops from ~11k to
around 500. So this approach alleviates a lot of the bottleneck. I've
also ran it against libhugetlbfs without any regressions.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-15 Thread Rik van Riel

On 07/15/2013 03:24 AM, David Gibson wrote:

On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:



Reading the existing comment, this change looks very suspicious to me.
A per-vma mutex is just not going to provide the necessary exclusion, is
it?  (But I recall next to nothing about these regions and
reservations.)


A per-VMA lock is definitely wrong.  I think it handles one form of
the race, between threads sharing a VM on a MAP_PRIVATE mapping.
However another form of the race can and does occur between different
MAP_SHARED VMAs in the same or different processes.  I think there may
be edge cases involving mremap() and MAP_PRIVATE that will also be
missed by a per-VMA lock.

Note that the libhugetlbfs testsuite contains tests for both PRIVATE
and SHARED variants of the race.


Can we get away with simply using a mutex in the file?
Say vma-vm_file-mapping-i_mmap_mutex?

That might help with multiple processes initializing
multiple shared memory segments at the same time, and
should not hurt the case of a process mapping its own
hugetlbfs area.

It might have the potential to hurt when getting private
copies on a MAP_PRIVATE area, though.  I have no idea
how common it is for multiple processes to MAP_PRIVATE
the same hugetlbfs file, though...

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-15 Thread Joonsoo Kim
On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote:
 On 07/15/2013 03:24 AM, David Gibson wrote:
 On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:
 
 Reading the existing comment, this change looks very suspicious to me.
 A per-vma mutex is just not going to provide the necessary exclusion, is
 it?  (But I recall next to nothing about these regions and
 reservations.)
 
 A per-VMA lock is definitely wrong.  I think it handles one form of
 the race, between threads sharing a VM on a MAP_PRIVATE mapping.
 However another form of the race can and does occur between different
 MAP_SHARED VMAs in the same or different processes.  I think there may
 be edge cases involving mremap() and MAP_PRIVATE that will also be
 missed by a per-VMA lock.
 
 Note that the libhugetlbfs testsuite contains tests for both PRIVATE
 and SHARED variants of the race.
 
 Can we get away with simply using a mutex in the file?
 Say vma-vm_file-mapping-i_mmap_mutex?

I totally agree with this approach :)

 
 That might help with multiple processes initializing
 multiple shared memory segments at the same time, and
 should not hurt the case of a process mapping its own
 hugetlbfs area.
 
 It might have the potential to hurt when getting private
 copies on a MAP_PRIVATE area, though.  I have no idea
 how common it is for multiple processes to MAP_PRIVATE
 the same hugetlbfs file, though...

Currently, getting private copies on a MAP_PRIVATE area is also
serialized by hugetlb_instantiation_mutex.
How do we get worse with your approach?

BTW, we have one race problem related to hugetlb_instantiation_mutex.
It is not right protection for region structure handling. We map the
area without holding a hugetlb_instantiation_mutex, so there is
race condition between mapping a new area and faulting the other area.
Am I missing?

Thanks.

 
 -- 
 All rights reversed
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-14 Thread Konstantin Khlebnikov

This seems incorrect. hugetlb_instantiation_mutex protects chains of struct 
file_region
in inode->i_mapping->private_list (VM_MAYSHARE) or vma_resv_map(vma)->regions 
(!VM_MAYSHARE)
These chains obviously can be shared between several vmas, so per-vma lock 
cannot protect them.

Davidlohr Bueso wrote:

The hugetlb_instantiation_mutex serializes hugepage allocation and instantiation
in the page directory entry. It was found that this mutex can become quite 
contended
during the early phases of large databases which make use of huge pages - for 
instance
startup and initial runs. One clear example is a 1.5Gb Oracle database, where 
lockstat
reports that this mutex can be one of the top 5 most contended locks in the 
kernel during
the first few minutes:

hugetlb_instantiation_mutex:  10678 10678
  ---
  hugetlb_instantiation_mutex10678  [] 
hugetlb_fault+0x9e/0x340
  ---
  hugetlb_instantiation_mutex10678  [] 
hugetlb_fault+0x9e/0x340

contentions:  10678
acquisitions: 99476
waittime-total: 76888911.01 us

Instead of serializing each hugetlb fault, we can deal with concurrent faults 
for pages
in different vmas. The per-vma mutex is initialized when creating a new vma. 
So, back to
the example above, we now get much less contention:

  >hugetlb_instantiation_mutex:  1 1
-
>hugetlb_instantiation_mutex   1   [] 
hugetlb_fault+0xa6/0x350
-
>hugetlb_instantiation_mutex   1[] 
hugetlb_fault+0xa6/0x350

contentions:  1
acquisitions:108092
waittime-total:  621.24 us

Signed-off-by: Davidlohr Bueso
---
  include/linux/mm_types.h |  3 +++
  mm/hugetlb.c | 12 +---
  mm/mmap.c|  3 +++
  3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fb425aa..b45fd87 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -289,6 +289,9 @@ struct vm_area_struct {
  #ifdef CONFIG_NUMA
struct mempolicy *vm_policy;/* NUMA policy for the VMA */
  #endif
+#ifdef CONFIG_HUGETLB_PAGE
+   struct mutex hugetlb_instantiation_mutex;
+#endif
  };

  struct core_thread {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 83aff0a..12e665b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -137,12 +137,12 @@ static inline struct hugepage_subpool *subpool_vma(struct 
vm_area_struct *vma)
   * The region data structures are protected by a combination of the mmap_sem
   * and the hugetlb_instantion_mutex.  To access or modify a region the caller
   * must either hold the mmap_sem for write, or the mmap_sem for read and
- * the hugetlb_instantiation mutex:
+ * the vma's hugetlb_instantiation mutex:
   *
   *down_write(>mmap_sem);
   * or
   *down_read(>mmap_sem);
- * mutex_lock(_instantiation_mutex);
+ * mutex_lock(>hugetlb_instantiation_mutex);
   */
  struct file_region {
struct list_head link;
@@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct 
vm_area_struct *vma,

  /*
   * Hugetlb_cow() should be called with page lock of the original hugepage 
held.
- * Called with hugetlb_instantiation_mutex held and pte_page locked so we
+ * Called with the vma's hugetlb_instantiation_mutex held and pte_page locked 
so we
   * cannot race with other handlers or page migration.
   * Keep the pte_same checks anyway to make transition from the mutex easier.
   */
@@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
int ret;
struct page *page = NULL;
struct page *pagecache_page = NULL;
-   static DEFINE_MUTEX(hugetlb_instantiation_mutex);
struct hstate *h = hstate_vma(vma);

address&= huge_page_mask(h);
@@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
 * get spurious allocation failures if two CPUs race to instantiate
 * the same page in the page cache.
 */
-   mutex_lock(_instantiation_mutex);
+   mutex_lock(>hugetlb_instantiation_mutex);
entry = huge_ptep_get(ptep);
if (huge_pte_none(entry)) {
ret = hugetlb_no_page(mm, vma, address, ptep, flags);
@@ -2943,8 +2942,7 @@ out_page_table_lock:
put_page(page);

  out_mutex:
-   mutex_unlock(_instantiation_mutex);
-
+   mutex_unlock(>hugetlb_instantiation_mutex);
return ret;
  }

diff --git a/mm/mmap.c b/mm/mmap.c
index fbad7b0..8f0b034 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1543,6 +1543,9 @@ munmap_back:
vma->vm_page_prot = vm_get_page_prot(vm_flags);
vma->vm_pgoff = pgoff;
INIT_LIST_HEAD(>anon_vma_chain);
+#ifdef CONFIG_HUGETLB_PAGE
+   mutex_init(>hugetlb_instantiation_mutex);
+#endif

error = -EINVAL;/* when rejecting 

Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-14 Thread Davidlohr Bueso
On Fri, 2013-07-12 at 17:54 -0700, Hugh Dickins wrote:
> Adding the essential David Gibson to the Cc list.
> 
> On Fri, 12 Jul 2013, Davidlohr Bueso wrote:
> 
> > The hugetlb_instantiation_mutex serializes hugepage allocation and 
> > instantiation
> > in the page directory entry. It was found that this mutex can become quite 
> > contended
> > during the early phases of large databases which make use of huge pages - 
> > for instance
> > startup and initial runs. One clear example is a 1.5Gb Oracle database, 
> > where lockstat
> > reports that this mutex can be one of the top 5 most contended locks in the 
> > kernel during
> > the first few minutes:
> > 
> > hugetlb_instantiation_mutex:  10678 10678
> >  ---
> >  hugetlb_instantiation_mutex10678  [] 
> > hugetlb_fault+0x9e/0x340
> >  ---
> >  hugetlb_instantiation_mutex10678  [] 
> > hugetlb_fault+0x9e/0x340
> > 
> > contentions:  10678
> > acquisitions: 99476
> > waittime-total: 76888911.01 us
> > 
> > Instead of serializing each hugetlb fault, we can deal with concurrent 
> > faults for pages
> > in different vmas. The per-vma mutex is initialized when creating a new 
> > vma. So, back to
> > the example above, we now get much less contention:
> > 
> >  >hugetlb_instantiation_mutex:  1 1
> >-
> >>hugetlb_instantiation_mutex   1   [] 
> > hugetlb_fault+0xa6/0x350
> >-
> >>hugetlb_instantiation_mutex   1[] 
> > hugetlb_fault+0xa6/0x350
> > 
> > contentions:  1
> > acquisitions:108092
> > waittime-total:  621.24 us
> > 
> > Signed-off-by: Davidlohr Bueso 
> 
> I agree this is a problem worth solving,
> but I doubt this patch is the right solution.
> 
> > ---
> >  include/linux/mm_types.h |  3 +++
> >  mm/hugetlb.c | 12 +---
> >  mm/mmap.c|  3 +++
> >  3 files changed, 11 insertions(+), 7 deletions(-)
> > 
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index fb425aa..b45fd87 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -289,6 +289,9 @@ struct vm_area_struct {
> >  #ifdef CONFIG_NUMA
> > struct mempolicy *vm_policy;/* NUMA policy for the VMA */
> >  #endif
> > +#ifdef CONFIG_HUGETLB_PAGE
> > +   struct mutex hugetlb_instantiation_mutex;
> > +#endif
> >  };
> 
> Bloating every vm_area_struct with a rarely useful mutex:
> I'm sure you can construct cases where per-vma mutex would win over
> per-mm mutex, but they will have to be very common to justify the bloat.
> 

I cannot disagree here, this was my main concern about this patch, and,
as you mentioned, if we can just get rid of the need for the lock, much
better.

> >  
> >  struct core_thread {
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 83aff0a..12e665b 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -137,12 +137,12 @@ static inline struct hugepage_subpool 
> > *subpool_vma(struct vm_area_struct *vma)
> >   * The region data structures are protected by a combination of the 
> > mmap_sem
> >   * and the hugetlb_instantion_mutex.  To access or modify a region the 
> > caller
> >   * must either hold the mmap_sem for write, or the mmap_sem for read and
> > - * the hugetlb_instantiation mutex:
> > + * the vma's hugetlb_instantiation mutex:
> 
> Reading the existing comment, this change looks very suspicious to me.
> A per-vma mutex is just not going to provide the necessary exclusion, is
> it?  (But I recall next to nothing about these regions and reservations.)
> 
> >   *
> >   * down_write(>mmap_sem);
> >   * or
> >   * down_read(>mmap_sem);
> > - * mutex_lock(_instantiation_mutex);
> > + * mutex_lock(>hugetlb_instantiation_mutex);
> >   */
> >  struct file_region {
> > struct list_head link;
> > @@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, 
> > struct vm_area_struct *vma,
> >  
> >  /*
> >   * Hugetlb_cow() should be called with page lock of the original hugepage 
> > held.
> > - * Called with hugetlb_instantiation_mutex held and pte_page locked so we
> > + * Called with the vma's hugetlb_instantiation_mutex held and pte_page 
> > locked so we
> >   * cannot race with other handlers or page migration.
> >   * Keep the pte_same checks anyway to make transition from the mutex 
> > easier.
> >   */
> > @@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct 
> > vm_area_struct *vma,
> > int ret;
> > struct page *page = NULL;
> > struct page *pagecache_page = NULL;
> > -   static DEFINE_MUTEX(hugetlb_instantiation_mutex);
> > struct hstate *h = hstate_vma(vma);
> >  
> > address &= huge_page_mask(h);
> > @@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
> > vm_area_struct *vma,
> >  * get spurious allocation failures if two CPUs race to 

Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-14 Thread Davidlohr Bueso
On Fri, 2013-07-12 at 17:54 -0700, Hugh Dickins wrote:
 Adding the essential David Gibson to the Cc list.
 
 On Fri, 12 Jul 2013, Davidlohr Bueso wrote:
 
  The hugetlb_instantiation_mutex serializes hugepage allocation and 
  instantiation
  in the page directory entry. It was found that this mutex can become quite 
  contended
  during the early phases of large databases which make use of huge pages - 
  for instance
  startup and initial runs. One clear example is a 1.5Gb Oracle database, 
  where lockstat
  reports that this mutex can be one of the top 5 most contended locks in the 
  kernel during
  the first few minutes:
  
  hugetlb_instantiation_mutex:  10678 10678
   ---
   hugetlb_instantiation_mutex10678  [8115e14e] 
  hugetlb_fault+0x9e/0x340
   ---
   hugetlb_instantiation_mutex10678  [8115e14e] 
  hugetlb_fault+0x9e/0x340
  
  contentions:  10678
  acquisitions: 99476
  waittime-total: 76888911.01 us
  
  Instead of serializing each hugetlb fault, we can deal with concurrent 
  faults for pages
  in different vmas. The per-vma mutex is initialized when creating a new 
  vma. So, back to
  the example above, we now get much less contention:
  
   vma-hugetlb_instantiation_mutex:  1 1
 -
 vma-hugetlb_instantiation_mutex   1   [8115e216] 
  hugetlb_fault+0xa6/0x350
 -
 vma-hugetlb_instantiation_mutex   1[8115e216] 
  hugetlb_fault+0xa6/0x350
  
  contentions:  1
  acquisitions:108092
  waittime-total:  621.24 us
  
  Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
 
 I agree this is a problem worth solving,
 but I doubt this patch is the right solution.
 
  ---
   include/linux/mm_types.h |  3 +++
   mm/hugetlb.c | 12 +---
   mm/mmap.c|  3 +++
   3 files changed, 11 insertions(+), 7 deletions(-)
  
  diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
  index fb425aa..b45fd87 100644
  --- a/include/linux/mm_types.h
  +++ b/include/linux/mm_types.h
  @@ -289,6 +289,9 @@ struct vm_area_struct {
   #ifdef CONFIG_NUMA
  struct mempolicy *vm_policy;/* NUMA policy for the VMA */
   #endif
  +#ifdef CONFIG_HUGETLB_PAGE
  +   struct mutex hugetlb_instantiation_mutex;
  +#endif
   };
 
 Bloating every vm_area_struct with a rarely useful mutex:
 I'm sure you can construct cases where per-vma mutex would win over
 per-mm mutex, but they will have to be very common to justify the bloat.
 

I cannot disagree here, this was my main concern about this patch, and,
as you mentioned, if we can just get rid of the need for the lock, much
better.

   
   struct core_thread {
  diff --git a/mm/hugetlb.c b/mm/hugetlb.c
  index 83aff0a..12e665b 100644
  --- a/mm/hugetlb.c
  +++ b/mm/hugetlb.c
  @@ -137,12 +137,12 @@ static inline struct hugepage_subpool 
  *subpool_vma(struct vm_area_struct *vma)
* The region data structures are protected by a combination of the 
  mmap_sem
* and the hugetlb_instantion_mutex.  To access or modify a region the 
  caller
* must either hold the mmap_sem for write, or the mmap_sem for read and
  - * the hugetlb_instantiation mutex:
  + * the vma's hugetlb_instantiation mutex:
 
 Reading the existing comment, this change looks very suspicious to me.
 A per-vma mutex is just not going to provide the necessary exclusion, is
 it?  (But I recall next to nothing about these regions and reservations.)
 
*
* down_write(mm-mmap_sem);
* or
* down_read(mm-mmap_sem);
  - * mutex_lock(hugetlb_instantiation_mutex);
  + * mutex_lock(vma-hugetlb_instantiation_mutex);
*/
   struct file_region {
  struct list_head link;
  @@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, 
  struct vm_area_struct *vma,
   
   /*
* Hugetlb_cow() should be called with page lock of the original hugepage 
  held.
  - * Called with hugetlb_instantiation_mutex held and pte_page locked so we
  + * Called with the vma's hugetlb_instantiation_mutex held and pte_page 
  locked so we
* cannot race with other handlers or page migration.
* Keep the pte_same checks anyway to make transition from the mutex 
  easier.
*/
  @@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct 
  vm_area_struct *vma,
  int ret;
  struct page *page = NULL;
  struct page *pagecache_page = NULL;
  -   static DEFINE_MUTEX(hugetlb_instantiation_mutex);
  struct hstate *h = hstate_vma(vma);
   
  address = huge_page_mask(h);
  @@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
  vm_area_struct *vma,
   * get spurious allocation failures if two CPUs race to instantiate
   * the same page in the page cache.
   */
  -   mutex_lock(hugetlb_instantiation_mutex);
  +   

Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-14 Thread Konstantin Khlebnikov

This seems incorrect. hugetlb_instantiation_mutex protects chains of struct 
file_region
in inode-i_mapping-private_list (VM_MAYSHARE) or vma_resv_map(vma)-regions 
(!VM_MAYSHARE)
These chains obviously can be shared between several vmas, so per-vma lock 
cannot protect them.

Davidlohr Bueso wrote:

The hugetlb_instantiation_mutex serializes hugepage allocation and instantiation
in the page directory entry. It was found that this mutex can become quite 
contended
during the early phases of large databases which make use of huge pages - for 
instance
startup and initial runs. One clear example is a 1.5Gb Oracle database, where 
lockstat
reports that this mutex can be one of the top 5 most contended locks in the 
kernel during
the first few minutes:

hugetlb_instantiation_mutex:  10678 10678
  ---
  hugetlb_instantiation_mutex10678  [8115e14e] 
hugetlb_fault+0x9e/0x340
  ---
  hugetlb_instantiation_mutex10678  [8115e14e] 
hugetlb_fault+0x9e/0x340

contentions:  10678
acquisitions: 99476
waittime-total: 76888911.01 us

Instead of serializing each hugetlb fault, we can deal with concurrent faults 
for pages
in different vmas. The per-vma mutex is initialized when creating a new vma. 
So, back to
the example above, we now get much less contention:

  vma-hugetlb_instantiation_mutex:  1 1
-
vma-hugetlb_instantiation_mutex   1   [8115e216] 
hugetlb_fault+0xa6/0x350
-
vma-hugetlb_instantiation_mutex   1[8115e216] 
hugetlb_fault+0xa6/0x350

contentions:  1
acquisitions:108092
waittime-total:  621.24 us

Signed-off-by: Davidlohr Buesodavidlohr.bu...@hp.com
---
  include/linux/mm_types.h |  3 +++
  mm/hugetlb.c | 12 +---
  mm/mmap.c|  3 +++
  3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fb425aa..b45fd87 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -289,6 +289,9 @@ struct vm_area_struct {
  #ifdef CONFIG_NUMA
struct mempolicy *vm_policy;/* NUMA policy for the VMA */
  #endif
+#ifdef CONFIG_HUGETLB_PAGE
+   struct mutex hugetlb_instantiation_mutex;
+#endif
  };

  struct core_thread {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 83aff0a..12e665b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -137,12 +137,12 @@ static inline struct hugepage_subpool *subpool_vma(struct 
vm_area_struct *vma)
   * The region data structures are protected by a combination of the mmap_sem
   * and the hugetlb_instantion_mutex.  To access or modify a region the caller
   * must either hold the mmap_sem for write, or the mmap_sem for read and
- * the hugetlb_instantiation mutex:
+ * the vma's hugetlb_instantiation mutex:
   *
   *down_write(mm-mmap_sem);
   * or
   *down_read(mm-mmap_sem);
- * mutex_lock(hugetlb_instantiation_mutex);
+ * mutex_lock(vma-hugetlb_instantiation_mutex);
   */
  struct file_region {
struct list_head link;
@@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct 
vm_area_struct *vma,

  /*
   * Hugetlb_cow() should be called with page lock of the original hugepage 
held.
- * Called with hugetlb_instantiation_mutex held and pte_page locked so we
+ * Called with the vma's hugetlb_instantiation_mutex held and pte_page locked 
so we
   * cannot race with other handlers or page migration.
   * Keep the pte_same checks anyway to make transition from the mutex easier.
   */
@@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
int ret;
struct page *page = NULL;
struct page *pagecache_page = NULL;
-   static DEFINE_MUTEX(hugetlb_instantiation_mutex);
struct hstate *h = hstate_vma(vma);

address= huge_page_mask(h);
@@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
 * get spurious allocation failures if two CPUs race to instantiate
 * the same page in the page cache.
 */
-   mutex_lock(hugetlb_instantiation_mutex);
+   mutex_lock(vma-hugetlb_instantiation_mutex);
entry = huge_ptep_get(ptep);
if (huge_pte_none(entry)) {
ret = hugetlb_no_page(mm, vma, address, ptep, flags);
@@ -2943,8 +2942,7 @@ out_page_table_lock:
put_page(page);

  out_mutex:
-   mutex_unlock(hugetlb_instantiation_mutex);
-
+   mutex_unlock(vma-hugetlb_instantiation_mutex);
return ret;
  }

diff --git a/mm/mmap.c b/mm/mmap.c
index fbad7b0..8f0b034 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1543,6 +1543,9 @@ munmap_back:
vma-vm_page_prot = vm_get_page_prot(vm_flags);
vma-vm_pgoff = pgoff;
INIT_LIST_HEAD(vma-anon_vma_chain);
+#ifdef 

Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-12 Thread Hugh Dickins
Adding the essential David Gibson to the Cc list.

On Fri, 12 Jul 2013, Davidlohr Bueso wrote:

> The hugetlb_instantiation_mutex serializes hugepage allocation and 
> instantiation
> in the page directory entry. It was found that this mutex can become quite 
> contended
> during the early phases of large databases which make use of huge pages - for 
> instance
> startup and initial runs. One clear example is a 1.5Gb Oracle database, where 
> lockstat
> reports that this mutex can be one of the top 5 most contended locks in the 
> kernel during
> the first few minutes:
> 
> hugetlb_instantiation_mutex:  10678 10678
>  ---
>  hugetlb_instantiation_mutex10678  [] 
> hugetlb_fault+0x9e/0x340
>  ---
>  hugetlb_instantiation_mutex10678  [] 
> hugetlb_fault+0x9e/0x340
> 
> contentions:  10678
> acquisitions: 99476
> waittime-total: 76888911.01 us
> 
> Instead of serializing each hugetlb fault, we can deal with concurrent faults 
> for pages
> in different vmas. The per-vma mutex is initialized when creating a new vma. 
> So, back to
> the example above, we now get much less contention:
> 
>  >hugetlb_instantiation_mutex:  1 1
>-
>>hugetlb_instantiation_mutex   1   [] 
> hugetlb_fault+0xa6/0x350
>-
>>hugetlb_instantiation_mutex   1[] 
> hugetlb_fault+0xa6/0x350
> 
> contentions:  1
> acquisitions:108092
> waittime-total:  621.24 us
> 
> Signed-off-by: Davidlohr Bueso 

I agree this is a problem worth solving,
but I doubt this patch is the right solution.

> ---
>  include/linux/mm_types.h |  3 +++
>  mm/hugetlb.c | 12 +---
>  mm/mmap.c|  3 +++
>  3 files changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index fb425aa..b45fd87 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -289,6 +289,9 @@ struct vm_area_struct {
>  #ifdef CONFIG_NUMA
>   struct mempolicy *vm_policy;/* NUMA policy for the VMA */
>  #endif
> +#ifdef CONFIG_HUGETLB_PAGE
> + struct mutex hugetlb_instantiation_mutex;
> +#endif
>  };

Bloating every vm_area_struct with a rarely useful mutex:
I'm sure you can construct cases where per-vma mutex would win over
per-mm mutex, but they will have to be very common to justify the bloat.

>  
>  struct core_thread {
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 83aff0a..12e665b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -137,12 +137,12 @@ static inline struct hugepage_subpool 
> *subpool_vma(struct vm_area_struct *vma)
>   * The region data structures are protected by a combination of the mmap_sem
>   * and the hugetlb_instantion_mutex.  To access or modify a region the caller
>   * must either hold the mmap_sem for write, or the mmap_sem for read and
> - * the hugetlb_instantiation mutex:
> + * the vma's hugetlb_instantiation mutex:

Reading the existing comment, this change looks very suspicious to me.
A per-vma mutex is just not going to provide the necessary exclusion, is
it?  (But I recall next to nothing about these regions and reservations.)

>   *
>   *   down_write(>mmap_sem);
>   * or
>   *   down_read(>mmap_sem);
> - *   mutex_lock(_instantiation_mutex);
> + *   mutex_lock(>hugetlb_instantiation_mutex);
>   */
>  struct file_region {
>   struct list_head link;
> @@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, 
> struct vm_area_struct *vma,
>  
>  /*
>   * Hugetlb_cow() should be called with page lock of the original hugepage 
> held.
> - * Called with hugetlb_instantiation_mutex held and pte_page locked so we
> + * Called with the vma's hugetlb_instantiation_mutex held and pte_page 
> locked so we
>   * cannot race with other handlers or page migration.
>   * Keep the pte_same checks anyway to make transition from the mutex easier.
>   */
> @@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   int ret;
>   struct page *page = NULL;
>   struct page *pagecache_page = NULL;
> - static DEFINE_MUTEX(hugetlb_instantiation_mutex);
>   struct hstate *h = hstate_vma(vma);
>  
>   address &= huge_page_mask(h);
> @@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>* get spurious allocation failures if two CPUs race to instantiate
>* the same page in the page cache.
>*/
> - mutex_lock(_instantiation_mutex);
> + mutex_lock(>hugetlb_instantiation_mutex);
>   entry = huge_ptep_get(ptep);
>   if (huge_pte_none(entry)) {
>   ret = hugetlb_no_page(mm, vma, address, ptep, flags);
> @@ -2943,8 +2942,7 @@ out_page_table_lock:
>   put_page(page);
>  
>  out_mutex:
> - mutex_unlock(_instantiation_mutex);
> -
> + 

Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

2013-07-12 Thread Hugh Dickins
Adding the essential David Gibson to the Cc list.

On Fri, 12 Jul 2013, Davidlohr Bueso wrote:

 The hugetlb_instantiation_mutex serializes hugepage allocation and 
 instantiation
 in the page directory entry. It was found that this mutex can become quite 
 contended
 during the early phases of large databases which make use of huge pages - for 
 instance
 startup and initial runs. One clear example is a 1.5Gb Oracle database, where 
 lockstat
 reports that this mutex can be one of the top 5 most contended locks in the 
 kernel during
 the first few minutes:
 
 hugetlb_instantiation_mutex:  10678 10678
  ---
  hugetlb_instantiation_mutex10678  [8115e14e] 
 hugetlb_fault+0x9e/0x340
  ---
  hugetlb_instantiation_mutex10678  [8115e14e] 
 hugetlb_fault+0x9e/0x340
 
 contentions:  10678
 acquisitions: 99476
 waittime-total: 76888911.01 us
 
 Instead of serializing each hugetlb fault, we can deal with concurrent faults 
 for pages
 in different vmas. The per-vma mutex is initialized when creating a new vma. 
 So, back to
 the example above, we now get much less contention:
 
  vma-hugetlb_instantiation_mutex:  1 1
-
vma-hugetlb_instantiation_mutex   1   [8115e216] 
 hugetlb_fault+0xa6/0x350
-
vma-hugetlb_instantiation_mutex   1[8115e216] 
 hugetlb_fault+0xa6/0x350
 
 contentions:  1
 acquisitions:108092
 waittime-total:  621.24 us
 
 Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com

I agree this is a problem worth solving,
but I doubt this patch is the right solution.

 ---
  include/linux/mm_types.h |  3 +++
  mm/hugetlb.c | 12 +---
  mm/mmap.c|  3 +++
  3 files changed, 11 insertions(+), 7 deletions(-)
 
 diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
 index fb425aa..b45fd87 100644
 --- a/include/linux/mm_types.h
 +++ b/include/linux/mm_types.h
 @@ -289,6 +289,9 @@ struct vm_area_struct {
  #ifdef CONFIG_NUMA
   struct mempolicy *vm_policy;/* NUMA policy for the VMA */
  #endif
 +#ifdef CONFIG_HUGETLB_PAGE
 + struct mutex hugetlb_instantiation_mutex;
 +#endif
  };

Bloating every vm_area_struct with a rarely useful mutex:
I'm sure you can construct cases where per-vma mutex would win over
per-mm mutex, but they will have to be very common to justify the bloat.

  
  struct core_thread {
 diff --git a/mm/hugetlb.c b/mm/hugetlb.c
 index 83aff0a..12e665b 100644
 --- a/mm/hugetlb.c
 +++ b/mm/hugetlb.c
 @@ -137,12 +137,12 @@ static inline struct hugepage_subpool 
 *subpool_vma(struct vm_area_struct *vma)
   * The region data structures are protected by a combination of the mmap_sem
   * and the hugetlb_instantion_mutex.  To access or modify a region the caller
   * must either hold the mmap_sem for write, or the mmap_sem for read and
 - * the hugetlb_instantiation mutex:
 + * the vma's hugetlb_instantiation mutex:

Reading the existing comment, this change looks very suspicious to me.
A per-vma mutex is just not going to provide the necessary exclusion, is
it?  (But I recall next to nothing about these regions and reservations.)

   *
   *   down_write(mm-mmap_sem);
   * or
   *   down_read(mm-mmap_sem);
 - *   mutex_lock(hugetlb_instantiation_mutex);
 + *   mutex_lock(vma-hugetlb_instantiation_mutex);
   */
  struct file_region {
   struct list_head link;
 @@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, 
 struct vm_area_struct *vma,
  
  /*
   * Hugetlb_cow() should be called with page lock of the original hugepage 
 held.
 - * Called with hugetlb_instantiation_mutex held and pte_page locked so we
 + * Called with the vma's hugetlb_instantiation_mutex held and pte_page 
 locked so we
   * cannot race with other handlers or page migration.
   * Keep the pte_same checks anyway to make transition from the mutex easier.
   */
 @@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct 
 vm_area_struct *vma,
   int ret;
   struct page *page = NULL;
   struct page *pagecache_page = NULL;
 - static DEFINE_MUTEX(hugetlb_instantiation_mutex);
   struct hstate *h = hstate_vma(vma);
  
   address = huge_page_mask(h);
 @@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
 vm_area_struct *vma,
* get spurious allocation failures if two CPUs race to instantiate
* the same page in the page cache.
*/
 - mutex_lock(hugetlb_instantiation_mutex);
 + mutex_lock(vma-hugetlb_instantiation_mutex);
   entry = huge_ptep_get(ptep);
   if (huge_pte_none(entry)) {
   ret = hugetlb_no_page(mm, vma, address, ptep, flags);
 @@ -2943,8 +2942,7 @@ out_page_table_lock:
   put_page(page);
  
  out_mutex:
 - mutex_unlock(hugetlb_instantiation_mutex);
 -
 +