Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
On Tue, Jul 16, 2013 at 08:01:46PM +1000, David Gibson wrote: > On Tue, Jul 16, 2013 at 02:34:24PM +0900, Joonsoo Kim wrote: > > On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote: > > > On 07/15/2013 03:24 AM, David Gibson wrote: > > > >On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote: > > > > > > >>>Reading the existing comment, this change looks very suspicious to me. > > > >>>A per-vma mutex is just not going to provide the necessary exclusion, > > > >>>is > > > >>>it? (But I recall next to nothing about these regions and > > > >>>reservations.) > > > > > > > >A per-VMA lock is definitely wrong. I think it handles one form of > > > >the race, between threads sharing a VM on a MAP_PRIVATE mapping. > > > >However another form of the race can and does occur between different > > > >MAP_SHARED VMAs in the same or different processes. I think there may > > > >be edge cases involving mremap() and MAP_PRIVATE that will also be > > > >missed by a per-VMA lock. > > > > > > > >Note that the libhugetlbfs testsuite contains tests for both PRIVATE > > > >and SHARED variants of the race. > > > > > > Can we get away with simply using a mutex in the file? > > > Say vma->vm_file->mapping->i_mmap_mutex? > > > > I totally agree with this approach :) > > > > > > > > That might help with multiple processes initializing > > > multiple shared memory segments at the same time, and > > > should not hurt the case of a process mapping its own > > > hugetlbfs area. > > > > > > It might have the potential to hurt when getting private > > > copies on a MAP_PRIVATE area, though. I have no idea > > > how common it is for multiple processes to MAP_PRIVATE > > > the same hugetlbfs file, though... > > > > Currently, getting private copies on a MAP_PRIVATE area is also > > serialized by hugetlb_instantiation_mutex. > > How do we get worse with your approach? > > > > BTW, we have one race problem related to hugetlb_instantiation_mutex. > > It is not right protection for region structure handling. We map the > > area without holding a hugetlb_instantiation_mutex, so there is > > race condition between mapping a new area and faulting the other area. > > Am I missing? > > The hugetlb_instantiation_mutex has nothing to do with protecting > region structures. It exists only to address one very specific and > frequently misunderstood race. Yes, it was introduced for that purpose, but, currently, it is also used for protecting region structure. You can see below comment in mm/hugetlb.c * The region data structures are protected by a combination of the mmap_sem * and the hugetlb_instantion_mutex. To access or modify a region the caller * must either hold the mmap_sem for write, or the mmap_sem for read and * the hugetlb_instantiation mutex: * * down_write(>mmap_sem); * or * down_read(>mmap_sem); * mutex_lock(_instantiation_mutex); */ Thanks. > > -- > David Gibson | I'll have my music baroque, and my code > david AT gibson.dropbear.id.au| minimalist, thank you. NOT _the_ > _other_ > | _way_ _around_! > http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
On Tue, Jul 16, 2013 at 08:01:46PM +1000, David Gibson wrote: On Tue, Jul 16, 2013 at 02:34:24PM +0900, Joonsoo Kim wrote: On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote: On 07/15/2013 03:24 AM, David Gibson wrote: On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote: Reading the existing comment, this change looks very suspicious to me. A per-vma mutex is just not going to provide the necessary exclusion, is it? (But I recall next to nothing about these regions and reservations.) A per-VMA lock is definitely wrong. I think it handles one form of the race, between threads sharing a VM on a MAP_PRIVATE mapping. However another form of the race can and does occur between different MAP_SHARED VMAs in the same or different processes. I think there may be edge cases involving mremap() and MAP_PRIVATE that will also be missed by a per-VMA lock. Note that the libhugetlbfs testsuite contains tests for both PRIVATE and SHARED variants of the race. Can we get away with simply using a mutex in the file? Say vma-vm_file-mapping-i_mmap_mutex? I totally agree with this approach :) That might help with multiple processes initializing multiple shared memory segments at the same time, and should not hurt the case of a process mapping its own hugetlbfs area. It might have the potential to hurt when getting private copies on a MAP_PRIVATE area, though. I have no idea how common it is for multiple processes to MAP_PRIVATE the same hugetlbfs file, though... Currently, getting private copies on a MAP_PRIVATE area is also serialized by hugetlb_instantiation_mutex. How do we get worse with your approach? BTW, we have one race problem related to hugetlb_instantiation_mutex. It is not right protection for region structure handling. We map the area without holding a hugetlb_instantiation_mutex, so there is race condition between mapping a new area and faulting the other area. Am I missing? The hugetlb_instantiation_mutex has nothing to do with protecting region structures. It exists only to address one very specific and frequently misunderstood race. Yes, it was introduced for that purpose, but, currently, it is also used for protecting region structure. You can see below comment in mm/hugetlb.c * The region data structures are protected by a combination of the mmap_sem * and the hugetlb_instantion_mutex. To access or modify a region the caller * must either hold the mmap_sem for write, or the mmap_sem for read and * the hugetlb_instantiation mutex: * * down_write(mm-mmap_sem); * or * down_read(mm-mmap_sem); * mutex_lock(hugetlb_instantiation_mutex); */ Thanks. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au| minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
On Tue, Jul 16, 2013 at 02:34:24PM +0900, Joonsoo Kim wrote: > On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote: > > On 07/15/2013 03:24 AM, David Gibson wrote: > > >On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote: > > > > >>>Reading the existing comment, this change looks very suspicious to me. > > >>>A per-vma mutex is just not going to provide the necessary exclusion, is > > >>>it? (But I recall next to nothing about these regions and > > >>>reservations.) > > > > > >A per-VMA lock is definitely wrong. I think it handles one form of > > >the race, between threads sharing a VM on a MAP_PRIVATE mapping. > > >However another form of the race can and does occur between different > > >MAP_SHARED VMAs in the same or different processes. I think there may > > >be edge cases involving mremap() and MAP_PRIVATE that will also be > > >missed by a per-VMA lock. > > > > > >Note that the libhugetlbfs testsuite contains tests for both PRIVATE > > >and SHARED variants of the race. > > > > Can we get away with simply using a mutex in the file? > > Say vma->vm_file->mapping->i_mmap_mutex? > > I totally agree with this approach :) > > > > > That might help with multiple processes initializing > > multiple shared memory segments at the same time, and > > should not hurt the case of a process mapping its own > > hugetlbfs area. > > > > It might have the potential to hurt when getting private > > copies on a MAP_PRIVATE area, though. I have no idea > > how common it is for multiple processes to MAP_PRIVATE > > the same hugetlbfs file, though... > > Currently, getting private copies on a MAP_PRIVATE area is also > serialized by hugetlb_instantiation_mutex. > How do we get worse with your approach? > > BTW, we have one race problem related to hugetlb_instantiation_mutex. > It is not right protection for region structure handling. We map the > area without holding a hugetlb_instantiation_mutex, so there is > race condition between mapping a new area and faulting the other area. > Am I missing? The hugetlb_instantiation_mutex has nothing to do with protecting region structures. It exists only to address one very specific and frequently misunderstood race. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpT1Y7aTSAX2.pgp Description: PGP signature
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote: > On 07/15/2013 03:24 AM, David Gibson wrote: > >On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote: > > >>>Reading the existing comment, this change looks very suspicious to me. > >>>A per-vma mutex is just not going to provide the necessary exclusion, is > >>>it? (But I recall next to nothing about these regions and > >>>reservations.) > > > >A per-VMA lock is definitely wrong. I think it handles one form of > >the race, between threads sharing a VM on a MAP_PRIVATE mapping. > >However another form of the race can and does occur between different > >MAP_SHARED VMAs in the same or different processes. I think there may > >be edge cases involving mremap() and MAP_PRIVATE that will also be > >missed by a per-VMA lock. > > > >Note that the libhugetlbfs testsuite contains tests for both PRIVATE > >and SHARED variants of the race. > > Can we get away with simply using a mutex in the file? > Say vma->vm_file->mapping->i_mmap_mutex? So I don't know the VM well enough to know if this could conflict with other usages of i_mmap_mutex. But unfortunately, whether or not its otherwise correct that approach won't address the scalability issue at hand here. In the case where the race matters, we're always dealing with the same file. Otherwise, we'd end up with a genuine, rather than spurious, out-of-memory error, regardless of how the race turned out. So in the case with the performance bottleneck we're considering, the i_mmap_mutex approach degenerates to serialization on a single mutex, just as before. In order to improve scalability, we need to consider which page within the file we're instantiating which is what the hash function in my patch does. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpMTETtH4O71.pgp Description: PGP signature
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
On Mon, Jul 15, 2013 at 05:12:31PM -0700, Davidlohr Bueso wrote: > On Mon, 2013-07-15 at 16:08 -0700, Andrew Morton wrote: > > On Mon, 15 Jul 2013 17:24:32 +1000 David Gibson > > wrote: > > > > > I have previously proposed a correct method of improving scalability, > > > although it doesn't eliminate the lock. That's to use a set of hashed > > > mutexes. > > > > Yep - hashing the mutexes is an obvious and nicely localized way of > > improving this. It's a tweak, not a design change. > > > > The changelog should describe the choice of the hash key with great > > precision, please. It's important and is the first thing which > > reviewers and readers will zoom in on. Yeah, that is important. I no longer have much interest in the result of this patch, so I'll leave it to others to do the forward port and cleanup. But I will point out the gotcha here is that the hash key needs to be based on (address_space & file offset) for MAP_SHARED, but (mm & address) for MAP_PRIVATE. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpNiJwXnZCrY.pgp Description: PGP signature
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
On Mon, Jul 15, 2013 at 05:12:31PM -0700, Davidlohr Bueso wrote: On Mon, 2013-07-15 at 16:08 -0700, Andrew Morton wrote: On Mon, 15 Jul 2013 17:24:32 +1000 David Gibson da...@gibson.dropbear.id.au wrote: I have previously proposed a correct method of improving scalability, although it doesn't eliminate the lock. That's to use a set of hashed mutexes. Yep - hashing the mutexes is an obvious and nicely localized way of improving this. It's a tweak, not a design change. The changelog should describe the choice of the hash key with great precision, please. It's important and is the first thing which reviewers and readers will zoom in on. Yeah, that is important. I no longer have much interest in the result of this patch, so I'll leave it to others to do the forward port and cleanup. But I will point out the gotcha here is that the hash key needs to be based on (address_space file offset) for MAP_SHARED, but (mm address) for MAP_PRIVATE. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpNiJwXnZCrY.pgp Description: PGP signature
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote: On 07/15/2013 03:24 AM, David Gibson wrote: On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote: Reading the existing comment, this change looks very suspicious to me. A per-vma mutex is just not going to provide the necessary exclusion, is it? (But I recall next to nothing about these regions and reservations.) A per-VMA lock is definitely wrong. I think it handles one form of the race, between threads sharing a VM on a MAP_PRIVATE mapping. However another form of the race can and does occur between different MAP_SHARED VMAs in the same or different processes. I think there may be edge cases involving mremap() and MAP_PRIVATE that will also be missed by a per-VMA lock. Note that the libhugetlbfs testsuite contains tests for both PRIVATE and SHARED variants of the race. Can we get away with simply using a mutex in the file? Say vma-vm_file-mapping-i_mmap_mutex? So I don't know the VM well enough to know if this could conflict with other usages of i_mmap_mutex. But unfortunately, whether or not its otherwise correct that approach won't address the scalability issue at hand here. In the case where the race matters, we're always dealing with the same file. Otherwise, we'd end up with a genuine, rather than spurious, out-of-memory error, regardless of how the race turned out. So in the case with the performance bottleneck we're considering, the i_mmap_mutex approach degenerates to serialization on a single mutex, just as before. In order to improve scalability, we need to consider which page within the file we're instantiating which is what the hash function in my patch does. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpMTETtH4O71.pgp Description: PGP signature
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
On Tue, Jul 16, 2013 at 02:34:24PM +0900, Joonsoo Kim wrote: On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote: On 07/15/2013 03:24 AM, David Gibson wrote: On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote: Reading the existing comment, this change looks very suspicious to me. A per-vma mutex is just not going to provide the necessary exclusion, is it? (But I recall next to nothing about these regions and reservations.) A per-VMA lock is definitely wrong. I think it handles one form of the race, between threads sharing a VM on a MAP_PRIVATE mapping. However another form of the race can and does occur between different MAP_SHARED VMAs in the same or different processes. I think there may be edge cases involving mremap() and MAP_PRIVATE that will also be missed by a per-VMA lock. Note that the libhugetlbfs testsuite contains tests for both PRIVATE and SHARED variants of the race. Can we get away with simply using a mutex in the file? Say vma-vm_file-mapping-i_mmap_mutex? I totally agree with this approach :) That might help with multiple processes initializing multiple shared memory segments at the same time, and should not hurt the case of a process mapping its own hugetlbfs area. It might have the potential to hurt when getting private copies on a MAP_PRIVATE area, though. I have no idea how common it is for multiple processes to MAP_PRIVATE the same hugetlbfs file, though... Currently, getting private copies on a MAP_PRIVATE area is also serialized by hugetlb_instantiation_mutex. How do we get worse with your approach? BTW, we have one race problem related to hugetlb_instantiation_mutex. It is not right protection for region structure handling. We map the area without holding a hugetlb_instantiation_mutex, so there is race condition between mapping a new area and faulting the other area. Am I missing? The hugetlb_instantiation_mutex has nothing to do with protecting region structures. It exists only to address one very specific and frequently misunderstood race. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpT1Y7aTSAX2.pgp Description: PGP signature
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote: > On 07/15/2013 03:24 AM, David Gibson wrote: > >On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote: > > >>>Reading the existing comment, this change looks very suspicious to me. > >>>A per-vma mutex is just not going to provide the necessary exclusion, is > >>>it? (But I recall next to nothing about these regions and > >>>reservations.) > > > >A per-VMA lock is definitely wrong. I think it handles one form of > >the race, between threads sharing a VM on a MAP_PRIVATE mapping. > >However another form of the race can and does occur between different > >MAP_SHARED VMAs in the same or different processes. I think there may > >be edge cases involving mremap() and MAP_PRIVATE that will also be > >missed by a per-VMA lock. > > > >Note that the libhugetlbfs testsuite contains tests for both PRIVATE > >and SHARED variants of the race. > > Can we get away with simply using a mutex in the file? > Say vma->vm_file->mapping->i_mmap_mutex? I totally agree with this approach :) > > That might help with multiple processes initializing > multiple shared memory segments at the same time, and > should not hurt the case of a process mapping its own > hugetlbfs area. > > It might have the potential to hurt when getting private > copies on a MAP_PRIVATE area, though. I have no idea > how common it is for multiple processes to MAP_PRIVATE > the same hugetlbfs file, though... Currently, getting private copies on a MAP_PRIVATE area is also serialized by hugetlb_instantiation_mutex. How do we get worse with your approach? BTW, we have one race problem related to hugetlb_instantiation_mutex. It is not right protection for region structure handling. We map the area without holding a hugetlb_instantiation_mutex, so there is race condition between mapping a new area and faulting the other area. Am I missing? Thanks. > > -- > All rights reversed > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
On 07/15/2013 03:24 AM, David Gibson wrote: On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote: Reading the existing comment, this change looks very suspicious to me. A per-vma mutex is just not going to provide the necessary exclusion, is it? (But I recall next to nothing about these regions and reservations.) A per-VMA lock is definitely wrong. I think it handles one form of the race, between threads sharing a VM on a MAP_PRIVATE mapping. However another form of the race can and does occur between different MAP_SHARED VMAs in the same or different processes. I think there may be edge cases involving mremap() and MAP_PRIVATE that will also be missed by a per-VMA lock. Note that the libhugetlbfs testsuite contains tests for both PRIVATE and SHARED variants of the race. Can we get away with simply using a mutex in the file? Say vma->vm_file->mapping->i_mmap_mutex? That might help with multiple processes initializing multiple shared memory segments at the same time, and should not hurt the case of a process mapping its own hugetlbfs area. It might have the potential to hurt when getting private copies on a MAP_PRIVATE area, though. I have no idea how common it is for multiple processes to MAP_PRIVATE the same hugetlbfs file, though... -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
On Mon, 2013-07-15 at 16:08 -0700, Andrew Morton wrote: > On Mon, 15 Jul 2013 17:24:32 +1000 David Gibson > wrote: > > > I have previously proposed a correct method of improving scalability, > > although it doesn't eliminate the lock. That's to use a set of hashed > > mutexes. > > Yep - hashing the mutexes is an obvious and nicely localized way of > improving this. It's a tweak, not a design change. > > The changelog should describe the choice of the hash key with great > precision, please. It's important and is the first thing which > reviewers and readers will zoom in on. > > Should the individual mutexes be cacheline aligned? Depends on the > acquisition frequency, I guess. Please let's work through that. In my test cases, involving different RDBMS, I'm getting around 114k acquisitions. > > Let's not damage uniprocesor kernels too much. AFACIT the main offender > here is fault_mutex_hash(), which is the world's most obfuscated "return > 0;". I guess we could add an ifndef CONFIG_SMP check to the function and return 0 right away. That would eliminate any overhead in fault_mutex_hash(). > > > It wasn't merged before, but I don't recall the reasons > > why. So I've forward ported the patch (will send once everyone agrees that the matter is settled), including the changes Anton Blanchard added a exactly two years ago: https://lkml.org/lkml/2011/7/15/31 My tests show that the number of lock contentions drops from ~11k to around 500. So this approach alleviates a lot of the bottleneck. I've also ran it against libhugetlbfs without any regressions. Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
On Mon, 15 Jul 2013 17:24:32 +1000 David Gibson wrote: > I have previously proposed a correct method of improving scalability, > although it doesn't eliminate the lock. That's to use a set of hashed > mutexes. Yep - hashing the mutexes is an obvious and nicely localized way of improving this. It's a tweak, not a design change. The changelog should describe the choice of the hash key with great precision, please. It's important and is the first thing which reviewers and readers will zoom in on. Should the individual mutexes be cacheline aligned? Depends on the acquisition frequency, I guess. Please let's work through that. Let's not damage uniprocesor kernels too much. AFACIT the main offender here is fault_mutex_hash(), which is the world's most obfuscated "return 0;". > It wasn't merged before, but I don't recall the reasons > why. Me either. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote: > On Fri, 2013-07-12 at 17:54 -0700, Hugh Dickins wrote: > > Adding the essential David Gibson to the Cc list. > > > > On Fri, 12 Jul 2013, Davidlohr Bueso wrote: > > > > > The hugetlb_instantiation_mutex serializes hugepage allocation and > > > instantiation > > > in the page directory entry. It was found that this mutex can become > > > quite contended > > > during the early phases of large databases which make use of huge pages - > > > for instance > > > startup and initial runs. One clear example is a 1.5Gb Oracle database, > > > where lockstat > > > reports that this mutex can be one of the top 5 most contended locks in > > > the kernel during > > > the first few minutes: > > > > > > hugetlb_instantiation_mutex: 10678 10678 > > > --- > > > hugetlb_instantiation_mutex10678 [] > > > hugetlb_fault+0x9e/0x340 > > > --- > > > hugetlb_instantiation_mutex10678 [] > > > hugetlb_fault+0x9e/0x340 > > > > > > contentions: 10678 > > > acquisitions: 99476 > > > waittime-total: 76888911.01 us > > > > > > Instead of serializing each hugetlb fault, we can deal with concurrent > > > faults for pages > > > in different vmas. The per-vma mutex is initialized when creating a new > > > vma. So, back to > > > the example above, we now get much less contention: > > > > > > >hugetlb_instantiation_mutex: 1 1 > > >- > > >>hugetlb_instantiation_mutex 1 [] > > > hugetlb_fault+0xa6/0x350 > > >- > > >>hugetlb_instantiation_mutex 1[] > > > hugetlb_fault+0xa6/0x350 > > > > > > contentions: 1 > > > acquisitions:108092 > > > waittime-total: 621.24 us > > > > > > Signed-off-by: Davidlohr Bueso > > > > I agree this is a problem worth solving, > > but I doubt this patch is the right solution. It's not. > > > --- > > > include/linux/mm_types.h | 3 +++ > > > mm/hugetlb.c | 12 +--- > > > mm/mmap.c| 3 +++ > > > 3 files changed, 11 insertions(+), 7 deletions(-) > > > > > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > > index fb425aa..b45fd87 100644 > > > --- a/include/linux/mm_types.h > > > +++ b/include/linux/mm_types.h > > > @@ -289,6 +289,9 @@ struct vm_area_struct { > > > #ifdef CONFIG_NUMA > > > struct mempolicy *vm_policy;/* NUMA policy for the VMA */ > > > #endif > > > +#ifdef CONFIG_HUGETLB_PAGE > > > + struct mutex hugetlb_instantiation_mutex; > > > +#endif > > > }; > > > > Bloating every vm_area_struct with a rarely useful mutex: > > I'm sure you can construct cases where per-vma mutex would win over > > per-mm mutex, but they will have to be very common to justify the bloat. > > > > I cannot disagree here, this was my main concern about this patch, and, > as you mentioned, if we can just get rid of the need for the lock, much > better. So, by all means try to get rid of the mutex, but doing so is surprisingly difficult - I made several failed attempts a long time ago, before giving up and creating the mutex in the first place. Note that the there is no analogous handling in the normal page case, because for normal pages we always assume we have a few pages of "slush" and can temporarily overallocate without problems. Because hugepages are a more precious resource, it's usual to want to allocate and use every single one - spurious OOMs when racing to allocate the very last hugepage are what the mutex protects against. > > > struct core_thread { > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > index 83aff0a..12e665b 100644 > > > --- a/mm/hugetlb.c > > > +++ b/mm/hugetlb.c > > > @@ -137,12 +137,12 @@ static inline struct hugepage_subpool > > > *subpool_vma(struct vm_area_struct *vma) > > > * The region data structures are protected by a combination of the > > > mmap_sem > > > * and the hugetlb_instantion_mutex. To access or modify a region the > > > caller > > > * must either hold the mmap_sem for write, or the mmap_sem for read and > > > - * the hugetlb_instantiation mutex: > > > + * the vma's hugetlb_instantiation mutex: > > > > Reading the existing comment, this change looks very suspicious to me. > > A per-vma mutex is just not going to provide the necessary exclusion, is > > it? (But I recall next to nothing about these regions and > > reservations.) A per-VMA lock is definitely wrong. I think it handles one form of the race, between threads sharing a VM on a MAP_PRIVATE mapping. However another form of the race can and does occur between different MAP_SHARED VMAs in the same or different processes. I think there may be edge cases involving mremap() and MAP_PRIVATE that will also be missed by a per-VMA lock. Note that the libhugetlbfs testsuite contains tests for both
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote: On Fri, 2013-07-12 at 17:54 -0700, Hugh Dickins wrote: Adding the essential David Gibson to the Cc list. On Fri, 12 Jul 2013, Davidlohr Bueso wrote: The hugetlb_instantiation_mutex serializes hugepage allocation and instantiation in the page directory entry. It was found that this mutex can become quite contended during the early phases of large databases which make use of huge pages - for instance startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat reports that this mutex can be one of the top 5 most contended locks in the kernel during the first few minutes: hugetlb_instantiation_mutex: 10678 10678 --- hugetlb_instantiation_mutex10678 [8115e14e] hugetlb_fault+0x9e/0x340 --- hugetlb_instantiation_mutex10678 [8115e14e] hugetlb_fault+0x9e/0x340 contentions: 10678 acquisitions: 99476 waittime-total: 76888911.01 us Instead of serializing each hugetlb fault, we can deal with concurrent faults for pages in different vmas. The per-vma mutex is initialized when creating a new vma. So, back to the example above, we now get much less contention: vma-hugetlb_instantiation_mutex: 1 1 - vma-hugetlb_instantiation_mutex 1 [8115e216] hugetlb_fault+0xa6/0x350 - vma-hugetlb_instantiation_mutex 1[8115e216] hugetlb_fault+0xa6/0x350 contentions: 1 acquisitions:108092 waittime-total: 621.24 us Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com I agree this is a problem worth solving, but I doubt this patch is the right solution. It's not. --- include/linux/mm_types.h | 3 +++ mm/hugetlb.c | 12 +--- mm/mmap.c| 3 +++ 3 files changed, 11 insertions(+), 7 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index fb425aa..b45fd87 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -289,6 +289,9 @@ struct vm_area_struct { #ifdef CONFIG_NUMA struct mempolicy *vm_policy;/* NUMA policy for the VMA */ #endif +#ifdef CONFIG_HUGETLB_PAGE + struct mutex hugetlb_instantiation_mutex; +#endif }; Bloating every vm_area_struct with a rarely useful mutex: I'm sure you can construct cases where per-vma mutex would win over per-mm mutex, but they will have to be very common to justify the bloat. I cannot disagree here, this was my main concern about this patch, and, as you mentioned, if we can just get rid of the need for the lock, much better. So, by all means try to get rid of the mutex, but doing so is surprisingly difficult - I made several failed attempts a long time ago, before giving up and creating the mutex in the first place. Note that the there is no analogous handling in the normal page case, because for normal pages we always assume we have a few pages of slush and can temporarily overallocate without problems. Because hugepages are a more precious resource, it's usual to want to allocate and use every single one - spurious OOMs when racing to allocate the very last hugepage are what the mutex protects against. struct core_thread { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 83aff0a..12e665b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -137,12 +137,12 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma) * The region data structures are protected by a combination of the mmap_sem * and the hugetlb_instantion_mutex. To access or modify a region the caller * must either hold the mmap_sem for write, or the mmap_sem for read and - * the hugetlb_instantiation mutex: + * the vma's hugetlb_instantiation mutex: Reading the existing comment, this change looks very suspicious to me. A per-vma mutex is just not going to provide the necessary exclusion, is it? (But I recall next to nothing about these regions and reservations.) A per-VMA lock is definitely wrong. I think it handles one form of the race, between threads sharing a VM on a MAP_PRIVATE mapping. However another form of the race can and does occur between different MAP_SHARED VMAs in the same or different processes. I think there may be edge cases involving mremap() and MAP_PRIVATE that will also be missed by a per-VMA lock. Note that the libhugetlbfs testsuite contains tests for both PRIVATE and SHARED variants of the race. * * down_write(mm-mmap_sem); * or * down_read(mm-mmap_sem); - *
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
On Mon, 15 Jul 2013 17:24:32 +1000 David Gibson da...@gibson.dropbear.id.au wrote: I have previously proposed a correct method of improving scalability, although it doesn't eliminate the lock. That's to use a set of hashed mutexes. Yep - hashing the mutexes is an obvious and nicely localized way of improving this. It's a tweak, not a design change. The changelog should describe the choice of the hash key with great precision, please. It's important and is the first thing which reviewers and readers will zoom in on. Should the individual mutexes be cacheline aligned? Depends on the acquisition frequency, I guess. Please let's work through that. Let's not damage uniprocesor kernels too much. AFACIT the main offender here is fault_mutex_hash(), which is the world's most obfuscated return 0;. It wasn't merged before, but I don't recall the reasons why. Me either. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
On Mon, 2013-07-15 at 16:08 -0700, Andrew Morton wrote: On Mon, 15 Jul 2013 17:24:32 +1000 David Gibson da...@gibson.dropbear.id.au wrote: I have previously proposed a correct method of improving scalability, although it doesn't eliminate the lock. That's to use a set of hashed mutexes. Yep - hashing the mutexes is an obvious and nicely localized way of improving this. It's a tweak, not a design change. The changelog should describe the choice of the hash key with great precision, please. It's important and is the first thing which reviewers and readers will zoom in on. Should the individual mutexes be cacheline aligned? Depends on the acquisition frequency, I guess. Please let's work through that. In my test cases, involving different RDBMS, I'm getting around 114k acquisitions. Let's not damage uniprocesor kernels too much. AFACIT the main offender here is fault_mutex_hash(), which is the world's most obfuscated return 0;. I guess we could add an ifndef CONFIG_SMP check to the function and return 0 right away. That would eliminate any overhead in fault_mutex_hash(). It wasn't merged before, but I don't recall the reasons why. So I've forward ported the patch (will send once everyone agrees that the matter is settled), including the changes Anton Blanchard added a exactly two years ago: https://lkml.org/lkml/2011/7/15/31 My tests show that the number of lock contentions drops from ~11k to around 500. So this approach alleviates a lot of the bottleneck. I've also ran it against libhugetlbfs without any regressions. Thanks, Davidlohr -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
On 07/15/2013 03:24 AM, David Gibson wrote: On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote: Reading the existing comment, this change looks very suspicious to me. A per-vma mutex is just not going to provide the necessary exclusion, is it? (But I recall next to nothing about these regions and reservations.) A per-VMA lock is definitely wrong. I think it handles one form of the race, between threads sharing a VM on a MAP_PRIVATE mapping. However another form of the race can and does occur between different MAP_SHARED VMAs in the same or different processes. I think there may be edge cases involving mremap() and MAP_PRIVATE that will also be missed by a per-VMA lock. Note that the libhugetlbfs testsuite contains tests for both PRIVATE and SHARED variants of the race. Can we get away with simply using a mutex in the file? Say vma-vm_file-mapping-i_mmap_mutex? That might help with multiple processes initializing multiple shared memory segments at the same time, and should not hurt the case of a process mapping its own hugetlbfs area. It might have the potential to hurt when getting private copies on a MAP_PRIVATE area, though. I have no idea how common it is for multiple processes to MAP_PRIVATE the same hugetlbfs file, though... -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
On Mon, Jul 15, 2013 at 09:51:21PM -0400, Rik van Riel wrote: On 07/15/2013 03:24 AM, David Gibson wrote: On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote: Reading the existing comment, this change looks very suspicious to me. A per-vma mutex is just not going to provide the necessary exclusion, is it? (But I recall next to nothing about these regions and reservations.) A per-VMA lock is definitely wrong. I think it handles one form of the race, between threads sharing a VM on a MAP_PRIVATE mapping. However another form of the race can and does occur between different MAP_SHARED VMAs in the same or different processes. I think there may be edge cases involving mremap() and MAP_PRIVATE that will also be missed by a per-VMA lock. Note that the libhugetlbfs testsuite contains tests for both PRIVATE and SHARED variants of the race. Can we get away with simply using a mutex in the file? Say vma-vm_file-mapping-i_mmap_mutex? I totally agree with this approach :) That might help with multiple processes initializing multiple shared memory segments at the same time, and should not hurt the case of a process mapping its own hugetlbfs area. It might have the potential to hurt when getting private copies on a MAP_PRIVATE area, though. I have no idea how common it is for multiple processes to MAP_PRIVATE the same hugetlbfs file, though... Currently, getting private copies on a MAP_PRIVATE area is also serialized by hugetlb_instantiation_mutex. How do we get worse with your approach? BTW, we have one race problem related to hugetlb_instantiation_mutex. It is not right protection for region structure handling. We map the area without holding a hugetlb_instantiation_mutex, so there is race condition between mapping a new area and faulting the other area. Am I missing? Thanks. -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
This seems incorrect. hugetlb_instantiation_mutex protects chains of struct file_region in inode->i_mapping->private_list (VM_MAYSHARE) or vma_resv_map(vma)->regions (!VM_MAYSHARE) These chains obviously can be shared between several vmas, so per-vma lock cannot protect them. Davidlohr Bueso wrote: The hugetlb_instantiation_mutex serializes hugepage allocation and instantiation in the page directory entry. It was found that this mutex can become quite contended during the early phases of large databases which make use of huge pages - for instance startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat reports that this mutex can be one of the top 5 most contended locks in the kernel during the first few minutes: hugetlb_instantiation_mutex: 10678 10678 --- hugetlb_instantiation_mutex10678 [] hugetlb_fault+0x9e/0x340 --- hugetlb_instantiation_mutex10678 [] hugetlb_fault+0x9e/0x340 contentions: 10678 acquisitions: 99476 waittime-total: 76888911.01 us Instead of serializing each hugetlb fault, we can deal with concurrent faults for pages in different vmas. The per-vma mutex is initialized when creating a new vma. So, back to the example above, we now get much less contention: >hugetlb_instantiation_mutex: 1 1 - >hugetlb_instantiation_mutex 1 [] hugetlb_fault+0xa6/0x350 - >hugetlb_instantiation_mutex 1[] hugetlb_fault+0xa6/0x350 contentions: 1 acquisitions:108092 waittime-total: 621.24 us Signed-off-by: Davidlohr Bueso --- include/linux/mm_types.h | 3 +++ mm/hugetlb.c | 12 +--- mm/mmap.c| 3 +++ 3 files changed, 11 insertions(+), 7 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index fb425aa..b45fd87 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -289,6 +289,9 @@ struct vm_area_struct { #ifdef CONFIG_NUMA struct mempolicy *vm_policy;/* NUMA policy for the VMA */ #endif +#ifdef CONFIG_HUGETLB_PAGE + struct mutex hugetlb_instantiation_mutex; +#endif }; struct core_thread { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 83aff0a..12e665b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -137,12 +137,12 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma) * The region data structures are protected by a combination of the mmap_sem * and the hugetlb_instantion_mutex. To access or modify a region the caller * must either hold the mmap_sem for write, or the mmap_sem for read and - * the hugetlb_instantiation mutex: + * the vma's hugetlb_instantiation mutex: * *down_write(>mmap_sem); * or *down_read(>mmap_sem); - * mutex_lock(_instantiation_mutex); + * mutex_lock(>hugetlb_instantiation_mutex); */ struct file_region { struct list_head link; @@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma, /* * Hugetlb_cow() should be called with page lock of the original hugepage held. - * Called with hugetlb_instantiation_mutex held and pte_page locked so we + * Called with the vma's hugetlb_instantiation_mutex held and pte_page locked so we * cannot race with other handlers or page migration. * Keep the pte_same checks anyway to make transition from the mutex easier. */ @@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, int ret; struct page *page = NULL; struct page *pagecache_page = NULL; - static DEFINE_MUTEX(hugetlb_instantiation_mutex); struct hstate *h = hstate_vma(vma); address&= huge_page_mask(h); @@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, * get spurious allocation failures if two CPUs race to instantiate * the same page in the page cache. */ - mutex_lock(_instantiation_mutex); + mutex_lock(>hugetlb_instantiation_mutex); entry = huge_ptep_get(ptep); if (huge_pte_none(entry)) { ret = hugetlb_no_page(mm, vma, address, ptep, flags); @@ -2943,8 +2942,7 @@ out_page_table_lock: put_page(page); out_mutex: - mutex_unlock(_instantiation_mutex); - + mutex_unlock(>hugetlb_instantiation_mutex); return ret; } diff --git a/mm/mmap.c b/mm/mmap.c index fbad7b0..8f0b034 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1543,6 +1543,9 @@ munmap_back: vma->vm_page_prot = vm_get_page_prot(vm_flags); vma->vm_pgoff = pgoff; INIT_LIST_HEAD(>anon_vma_chain); +#ifdef CONFIG_HUGETLB_PAGE + mutex_init(>hugetlb_instantiation_mutex); +#endif error = -EINVAL;/* when rejecting
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
On Fri, 2013-07-12 at 17:54 -0700, Hugh Dickins wrote: > Adding the essential David Gibson to the Cc list. > > On Fri, 12 Jul 2013, Davidlohr Bueso wrote: > > > The hugetlb_instantiation_mutex serializes hugepage allocation and > > instantiation > > in the page directory entry. It was found that this mutex can become quite > > contended > > during the early phases of large databases which make use of huge pages - > > for instance > > startup and initial runs. One clear example is a 1.5Gb Oracle database, > > where lockstat > > reports that this mutex can be one of the top 5 most contended locks in the > > kernel during > > the first few minutes: > > > > hugetlb_instantiation_mutex: 10678 10678 > > --- > > hugetlb_instantiation_mutex10678 [] > > hugetlb_fault+0x9e/0x340 > > --- > > hugetlb_instantiation_mutex10678 [] > > hugetlb_fault+0x9e/0x340 > > > > contentions: 10678 > > acquisitions: 99476 > > waittime-total: 76888911.01 us > > > > Instead of serializing each hugetlb fault, we can deal with concurrent > > faults for pages > > in different vmas. The per-vma mutex is initialized when creating a new > > vma. So, back to > > the example above, we now get much less contention: > > > > >hugetlb_instantiation_mutex: 1 1 > >- > >>hugetlb_instantiation_mutex 1 [] > > hugetlb_fault+0xa6/0x350 > >- > >>hugetlb_instantiation_mutex 1[] > > hugetlb_fault+0xa6/0x350 > > > > contentions: 1 > > acquisitions:108092 > > waittime-total: 621.24 us > > > > Signed-off-by: Davidlohr Bueso > > I agree this is a problem worth solving, > but I doubt this patch is the right solution. > > > --- > > include/linux/mm_types.h | 3 +++ > > mm/hugetlb.c | 12 +--- > > mm/mmap.c| 3 +++ > > 3 files changed, 11 insertions(+), 7 deletions(-) > > > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > index fb425aa..b45fd87 100644 > > --- a/include/linux/mm_types.h > > +++ b/include/linux/mm_types.h > > @@ -289,6 +289,9 @@ struct vm_area_struct { > > #ifdef CONFIG_NUMA > > struct mempolicy *vm_policy;/* NUMA policy for the VMA */ > > #endif > > +#ifdef CONFIG_HUGETLB_PAGE > > + struct mutex hugetlb_instantiation_mutex; > > +#endif > > }; > > Bloating every vm_area_struct with a rarely useful mutex: > I'm sure you can construct cases where per-vma mutex would win over > per-mm mutex, but they will have to be very common to justify the bloat. > I cannot disagree here, this was my main concern about this patch, and, as you mentioned, if we can just get rid of the need for the lock, much better. > > > > struct core_thread { > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 83aff0a..12e665b 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -137,12 +137,12 @@ static inline struct hugepage_subpool > > *subpool_vma(struct vm_area_struct *vma) > > * The region data structures are protected by a combination of the > > mmap_sem > > * and the hugetlb_instantion_mutex. To access or modify a region the > > caller > > * must either hold the mmap_sem for write, or the mmap_sem for read and > > - * the hugetlb_instantiation mutex: > > + * the vma's hugetlb_instantiation mutex: > > Reading the existing comment, this change looks very suspicious to me. > A per-vma mutex is just not going to provide the necessary exclusion, is > it? (But I recall next to nothing about these regions and reservations.) > > > * > > * down_write(>mmap_sem); > > * or > > * down_read(>mmap_sem); > > - * mutex_lock(_instantiation_mutex); > > + * mutex_lock(>hugetlb_instantiation_mutex); > > */ > > struct file_region { > > struct list_head link; > > @@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, > > struct vm_area_struct *vma, > > > > /* > > * Hugetlb_cow() should be called with page lock of the original hugepage > > held. > > - * Called with hugetlb_instantiation_mutex held and pte_page locked so we > > + * Called with the vma's hugetlb_instantiation_mutex held and pte_page > > locked so we > > * cannot race with other handlers or page migration. > > * Keep the pte_same checks anyway to make transition from the mutex > > easier. > > */ > > @@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct > > vm_area_struct *vma, > > int ret; > > struct page *page = NULL; > > struct page *pagecache_page = NULL; > > - static DEFINE_MUTEX(hugetlb_instantiation_mutex); > > struct hstate *h = hstate_vma(vma); > > > > address &= huge_page_mask(h); > > @@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct > > vm_area_struct *vma, > > * get spurious allocation failures if two CPUs race to
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
On Fri, 2013-07-12 at 17:54 -0700, Hugh Dickins wrote: Adding the essential David Gibson to the Cc list. On Fri, 12 Jul 2013, Davidlohr Bueso wrote: The hugetlb_instantiation_mutex serializes hugepage allocation and instantiation in the page directory entry. It was found that this mutex can become quite contended during the early phases of large databases which make use of huge pages - for instance startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat reports that this mutex can be one of the top 5 most contended locks in the kernel during the first few minutes: hugetlb_instantiation_mutex: 10678 10678 --- hugetlb_instantiation_mutex10678 [8115e14e] hugetlb_fault+0x9e/0x340 --- hugetlb_instantiation_mutex10678 [8115e14e] hugetlb_fault+0x9e/0x340 contentions: 10678 acquisitions: 99476 waittime-total: 76888911.01 us Instead of serializing each hugetlb fault, we can deal with concurrent faults for pages in different vmas. The per-vma mutex is initialized when creating a new vma. So, back to the example above, we now get much less contention: vma-hugetlb_instantiation_mutex: 1 1 - vma-hugetlb_instantiation_mutex 1 [8115e216] hugetlb_fault+0xa6/0x350 - vma-hugetlb_instantiation_mutex 1[8115e216] hugetlb_fault+0xa6/0x350 contentions: 1 acquisitions:108092 waittime-total: 621.24 us Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com I agree this is a problem worth solving, but I doubt this patch is the right solution. --- include/linux/mm_types.h | 3 +++ mm/hugetlb.c | 12 +--- mm/mmap.c| 3 +++ 3 files changed, 11 insertions(+), 7 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index fb425aa..b45fd87 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -289,6 +289,9 @@ struct vm_area_struct { #ifdef CONFIG_NUMA struct mempolicy *vm_policy;/* NUMA policy for the VMA */ #endif +#ifdef CONFIG_HUGETLB_PAGE + struct mutex hugetlb_instantiation_mutex; +#endif }; Bloating every vm_area_struct with a rarely useful mutex: I'm sure you can construct cases where per-vma mutex would win over per-mm mutex, but they will have to be very common to justify the bloat. I cannot disagree here, this was my main concern about this patch, and, as you mentioned, if we can just get rid of the need for the lock, much better. struct core_thread { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 83aff0a..12e665b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -137,12 +137,12 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma) * The region data structures are protected by a combination of the mmap_sem * and the hugetlb_instantion_mutex. To access or modify a region the caller * must either hold the mmap_sem for write, or the mmap_sem for read and - * the hugetlb_instantiation mutex: + * the vma's hugetlb_instantiation mutex: Reading the existing comment, this change looks very suspicious to me. A per-vma mutex is just not going to provide the necessary exclusion, is it? (But I recall next to nothing about these regions and reservations.) * * down_write(mm-mmap_sem); * or * down_read(mm-mmap_sem); - * mutex_lock(hugetlb_instantiation_mutex); + * mutex_lock(vma-hugetlb_instantiation_mutex); */ struct file_region { struct list_head link; @@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma, /* * Hugetlb_cow() should be called with page lock of the original hugepage held. - * Called with hugetlb_instantiation_mutex held and pte_page locked so we + * Called with the vma's hugetlb_instantiation_mutex held and pte_page locked so we * cannot race with other handlers or page migration. * Keep the pte_same checks anyway to make transition from the mutex easier. */ @@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, int ret; struct page *page = NULL; struct page *pagecache_page = NULL; - static DEFINE_MUTEX(hugetlb_instantiation_mutex); struct hstate *h = hstate_vma(vma); address = huge_page_mask(h); @@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, * get spurious allocation failures if two CPUs race to instantiate * the same page in the page cache. */ - mutex_lock(hugetlb_instantiation_mutex); +
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
This seems incorrect. hugetlb_instantiation_mutex protects chains of struct file_region in inode-i_mapping-private_list (VM_MAYSHARE) or vma_resv_map(vma)-regions (!VM_MAYSHARE) These chains obviously can be shared between several vmas, so per-vma lock cannot protect them. Davidlohr Bueso wrote: The hugetlb_instantiation_mutex serializes hugepage allocation and instantiation in the page directory entry. It was found that this mutex can become quite contended during the early phases of large databases which make use of huge pages - for instance startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat reports that this mutex can be one of the top 5 most contended locks in the kernel during the first few minutes: hugetlb_instantiation_mutex: 10678 10678 --- hugetlb_instantiation_mutex10678 [8115e14e] hugetlb_fault+0x9e/0x340 --- hugetlb_instantiation_mutex10678 [8115e14e] hugetlb_fault+0x9e/0x340 contentions: 10678 acquisitions: 99476 waittime-total: 76888911.01 us Instead of serializing each hugetlb fault, we can deal with concurrent faults for pages in different vmas. The per-vma mutex is initialized when creating a new vma. So, back to the example above, we now get much less contention: vma-hugetlb_instantiation_mutex: 1 1 - vma-hugetlb_instantiation_mutex 1 [8115e216] hugetlb_fault+0xa6/0x350 - vma-hugetlb_instantiation_mutex 1[8115e216] hugetlb_fault+0xa6/0x350 contentions: 1 acquisitions:108092 waittime-total: 621.24 us Signed-off-by: Davidlohr Buesodavidlohr.bu...@hp.com --- include/linux/mm_types.h | 3 +++ mm/hugetlb.c | 12 +--- mm/mmap.c| 3 +++ 3 files changed, 11 insertions(+), 7 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index fb425aa..b45fd87 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -289,6 +289,9 @@ struct vm_area_struct { #ifdef CONFIG_NUMA struct mempolicy *vm_policy;/* NUMA policy for the VMA */ #endif +#ifdef CONFIG_HUGETLB_PAGE + struct mutex hugetlb_instantiation_mutex; +#endif }; struct core_thread { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 83aff0a..12e665b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -137,12 +137,12 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma) * The region data structures are protected by a combination of the mmap_sem * and the hugetlb_instantion_mutex. To access or modify a region the caller * must either hold the mmap_sem for write, or the mmap_sem for read and - * the hugetlb_instantiation mutex: + * the vma's hugetlb_instantiation mutex: * *down_write(mm-mmap_sem); * or *down_read(mm-mmap_sem); - * mutex_lock(hugetlb_instantiation_mutex); + * mutex_lock(vma-hugetlb_instantiation_mutex); */ struct file_region { struct list_head link; @@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma, /* * Hugetlb_cow() should be called with page lock of the original hugepage held. - * Called with hugetlb_instantiation_mutex held and pte_page locked so we + * Called with the vma's hugetlb_instantiation_mutex held and pte_page locked so we * cannot race with other handlers or page migration. * Keep the pte_same checks anyway to make transition from the mutex easier. */ @@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, int ret; struct page *page = NULL; struct page *pagecache_page = NULL; - static DEFINE_MUTEX(hugetlb_instantiation_mutex); struct hstate *h = hstate_vma(vma); address= huge_page_mask(h); @@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, * get spurious allocation failures if two CPUs race to instantiate * the same page in the page cache. */ - mutex_lock(hugetlb_instantiation_mutex); + mutex_lock(vma-hugetlb_instantiation_mutex); entry = huge_ptep_get(ptep); if (huge_pte_none(entry)) { ret = hugetlb_no_page(mm, vma, address, ptep, flags); @@ -2943,8 +2942,7 @@ out_page_table_lock: put_page(page); out_mutex: - mutex_unlock(hugetlb_instantiation_mutex); - + mutex_unlock(vma-hugetlb_instantiation_mutex); return ret; } diff --git a/mm/mmap.c b/mm/mmap.c index fbad7b0..8f0b034 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1543,6 +1543,9 @@ munmap_back: vma-vm_page_prot = vm_get_page_prot(vm_flags); vma-vm_pgoff = pgoff; INIT_LIST_HEAD(vma-anon_vma_chain); +#ifdef
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
Adding the essential David Gibson to the Cc list. On Fri, 12 Jul 2013, Davidlohr Bueso wrote: > The hugetlb_instantiation_mutex serializes hugepage allocation and > instantiation > in the page directory entry. It was found that this mutex can become quite > contended > during the early phases of large databases which make use of huge pages - for > instance > startup and initial runs. One clear example is a 1.5Gb Oracle database, where > lockstat > reports that this mutex can be one of the top 5 most contended locks in the > kernel during > the first few minutes: > > hugetlb_instantiation_mutex: 10678 10678 > --- > hugetlb_instantiation_mutex10678 [] > hugetlb_fault+0x9e/0x340 > --- > hugetlb_instantiation_mutex10678 [] > hugetlb_fault+0x9e/0x340 > > contentions: 10678 > acquisitions: 99476 > waittime-total: 76888911.01 us > > Instead of serializing each hugetlb fault, we can deal with concurrent faults > for pages > in different vmas. The per-vma mutex is initialized when creating a new vma. > So, back to > the example above, we now get much less contention: > > >hugetlb_instantiation_mutex: 1 1 >- >>hugetlb_instantiation_mutex 1 [] > hugetlb_fault+0xa6/0x350 >- >>hugetlb_instantiation_mutex 1[] > hugetlb_fault+0xa6/0x350 > > contentions: 1 > acquisitions:108092 > waittime-total: 621.24 us > > Signed-off-by: Davidlohr Bueso I agree this is a problem worth solving, but I doubt this patch is the right solution. > --- > include/linux/mm_types.h | 3 +++ > mm/hugetlb.c | 12 +--- > mm/mmap.c| 3 +++ > 3 files changed, 11 insertions(+), 7 deletions(-) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index fb425aa..b45fd87 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -289,6 +289,9 @@ struct vm_area_struct { > #ifdef CONFIG_NUMA > struct mempolicy *vm_policy;/* NUMA policy for the VMA */ > #endif > +#ifdef CONFIG_HUGETLB_PAGE > + struct mutex hugetlb_instantiation_mutex; > +#endif > }; Bloating every vm_area_struct with a rarely useful mutex: I'm sure you can construct cases where per-vma mutex would win over per-mm mutex, but they will have to be very common to justify the bloat. > > struct core_thread { > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 83aff0a..12e665b 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -137,12 +137,12 @@ static inline struct hugepage_subpool > *subpool_vma(struct vm_area_struct *vma) > * The region data structures are protected by a combination of the mmap_sem > * and the hugetlb_instantion_mutex. To access or modify a region the caller > * must either hold the mmap_sem for write, or the mmap_sem for read and > - * the hugetlb_instantiation mutex: > + * the vma's hugetlb_instantiation mutex: Reading the existing comment, this change looks very suspicious to me. A per-vma mutex is just not going to provide the necessary exclusion, is it? (But I recall next to nothing about these regions and reservations.) > * > * down_write(>mmap_sem); > * or > * down_read(>mmap_sem); > - * mutex_lock(_instantiation_mutex); > + * mutex_lock(>hugetlb_instantiation_mutex); > */ > struct file_region { > struct list_head link; > @@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, > struct vm_area_struct *vma, > > /* > * Hugetlb_cow() should be called with page lock of the original hugepage > held. > - * Called with hugetlb_instantiation_mutex held and pte_page locked so we > + * Called with the vma's hugetlb_instantiation_mutex held and pte_page > locked so we > * cannot race with other handlers or page migration. > * Keep the pte_same checks anyway to make transition from the mutex easier. > */ > @@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct > vm_area_struct *vma, > int ret; > struct page *page = NULL; > struct page *pagecache_page = NULL; > - static DEFINE_MUTEX(hugetlb_instantiation_mutex); > struct hstate *h = hstate_vma(vma); > > address &= huge_page_mask(h); > @@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct > vm_area_struct *vma, >* get spurious allocation failures if two CPUs race to instantiate >* the same page in the page cache. >*/ > - mutex_lock(_instantiation_mutex); > + mutex_lock(>hugetlb_instantiation_mutex); > entry = huge_ptep_get(ptep); > if (huge_pte_none(entry)) { > ret = hugetlb_no_page(mm, vma, address, ptep, flags); > @@ -2943,8 +2942,7 @@ out_page_table_lock: > put_page(page); > > out_mutex: > - mutex_unlock(_instantiation_mutex); > - > +
Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes
Adding the essential David Gibson to the Cc list. On Fri, 12 Jul 2013, Davidlohr Bueso wrote: The hugetlb_instantiation_mutex serializes hugepage allocation and instantiation in the page directory entry. It was found that this mutex can become quite contended during the early phases of large databases which make use of huge pages - for instance startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat reports that this mutex can be one of the top 5 most contended locks in the kernel during the first few minutes: hugetlb_instantiation_mutex: 10678 10678 --- hugetlb_instantiation_mutex10678 [8115e14e] hugetlb_fault+0x9e/0x340 --- hugetlb_instantiation_mutex10678 [8115e14e] hugetlb_fault+0x9e/0x340 contentions: 10678 acquisitions: 99476 waittime-total: 76888911.01 us Instead of serializing each hugetlb fault, we can deal with concurrent faults for pages in different vmas. The per-vma mutex is initialized when creating a new vma. So, back to the example above, we now get much less contention: vma-hugetlb_instantiation_mutex: 1 1 - vma-hugetlb_instantiation_mutex 1 [8115e216] hugetlb_fault+0xa6/0x350 - vma-hugetlb_instantiation_mutex 1[8115e216] hugetlb_fault+0xa6/0x350 contentions: 1 acquisitions:108092 waittime-total: 621.24 us Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com I agree this is a problem worth solving, but I doubt this patch is the right solution. --- include/linux/mm_types.h | 3 +++ mm/hugetlb.c | 12 +--- mm/mmap.c| 3 +++ 3 files changed, 11 insertions(+), 7 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index fb425aa..b45fd87 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -289,6 +289,9 @@ struct vm_area_struct { #ifdef CONFIG_NUMA struct mempolicy *vm_policy;/* NUMA policy for the VMA */ #endif +#ifdef CONFIG_HUGETLB_PAGE + struct mutex hugetlb_instantiation_mutex; +#endif }; Bloating every vm_area_struct with a rarely useful mutex: I'm sure you can construct cases where per-vma mutex would win over per-mm mutex, but they will have to be very common to justify the bloat. struct core_thread { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 83aff0a..12e665b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -137,12 +137,12 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma) * The region data structures are protected by a combination of the mmap_sem * and the hugetlb_instantion_mutex. To access or modify a region the caller * must either hold the mmap_sem for write, or the mmap_sem for read and - * the hugetlb_instantiation mutex: + * the vma's hugetlb_instantiation mutex: Reading the existing comment, this change looks very suspicious to me. A per-vma mutex is just not going to provide the necessary exclusion, is it? (But I recall next to nothing about these regions and reservations.) * * down_write(mm-mmap_sem); * or * down_read(mm-mmap_sem); - * mutex_lock(hugetlb_instantiation_mutex); + * mutex_lock(vma-hugetlb_instantiation_mutex); */ struct file_region { struct list_head link; @@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma, /* * Hugetlb_cow() should be called with page lock of the original hugepage held. - * Called with hugetlb_instantiation_mutex held and pte_page locked so we + * Called with the vma's hugetlb_instantiation_mutex held and pte_page locked so we * cannot race with other handlers or page migration. * Keep the pte_same checks anyway to make transition from the mutex easier. */ @@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, int ret; struct page *page = NULL; struct page *pagecache_page = NULL; - static DEFINE_MUTEX(hugetlb_instantiation_mutex); struct hstate *h = hstate_vma(vma); address = huge_page_mask(h); @@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, * get spurious allocation failures if two CPUs race to instantiate * the same page in the page cache. */ - mutex_lock(hugetlb_instantiation_mutex); + mutex_lock(vma-hugetlb_instantiation_mutex); entry = huge_ptep_get(ptep); if (huge_pte_none(entry)) { ret = hugetlb_no_page(mm, vma, address, ptep, flags); @@ -2943,8 +2942,7 @@ out_page_table_lock: put_page(page); out_mutex: - mutex_unlock(hugetlb_instantiation_mutex); - +