On 06/23/2014 05:52 PM, Vlastimil Babka wrote: > On 06/23/2014 07:39 AM, Zhang Yanfei wrote: >> Hello >> >> On 06/21/2014 01:45 AM, Kirill A. Shutemov wrote: >>> On Fri, Jun 20, 2014 at 05:49:31PM +0200, Vlastimil Babka wrote: >>>> When allocating huge page for collapsing, khugepaged currently holds >>>> mmap_sem >>>> for reading on the mm where collapsing occurs. Afterwards the read lock is >>>> dropped before write lock is taken on the same mmap_sem. >>>> >>>> Holding mmap_sem during whole huge page allocation is therefore useless, >>>> the >>>> vma needs to be rechecked after taking the write lock anyway. Furthemore, >>>> huge >>>> page allocation might involve a rather long sync compaction, and thus block >>>> any mmap_sem writers and i.e. affect workloads that perform frequent >>>> m(un)map >>>> or mprotect oterations. >>>> >>>> This patch simply releases the read lock before allocating a huge page. It >>>> also deletes an outdated comment that assumed vma must be stable, as it was >>>> using alloc_hugepage_vma(). This is no longer true since commit 9f1b868a13 >>>> ("mm: thp: khugepaged: add policy for finding target node"). >>> >>> There is no point in touching ->mmap_sem in khugepaged_alloc_page() at >>> all. Please, move up_read() outside khugepaged_alloc_page(). >>> > > Well there's also currently no point in passing several parameters to > khugepaged_alloc_page(). So I could clean it up as well, but I imagine later > we would perhaps reintroduce them back, as I don't think the current > situation is ideal for at least two reasons. > > 1. If you read commit 9f1b868a13 ("mm: thp: khugepaged: add policy for > finding target node"), it's based on a report where somebody found that > mempolicy is not observed properly when collapsing THP's. But the 'policy' > introduced by the commit isn't based on real mempolicy, it might just under > certain conditions results in an interleave, which happens to be what the > reporter was trying. > > So ideally, it should be making node allocation decisions based on where the > original 4KB pages are located. For example, allocate a THP only if all the > 4KB pages are on the same node. That would also automatically obey any policy > that has lead to the allocation of those 4KB pages. > > And for this, it will need again the parameters and mmap_sem in read mode. It > would be however still a good idea to drop mmap_sem before the allocation > itself, since compaction/reclaim might take some time... > > 2. (less related) I'd expect khugepaged to first allocate a hugepage and then > scan for collapsing. Yes there's khugepaged_prealloc_page, but that only does > something on !NUMA systems and these are not the future. > Although I don't have the data, I expect allocating a hugepage is a bigger > issue than finding something that could be collapsed. So why scan for > collapsing if in the end I cannot allocate a hugepage? And if I really cannot > find something to collapse, would e.g. caching a single hugepage per node be > a big hit? Also, if there's really nothing to collapse, then it means > khugepaged won't compact. And since khugepaged is becoming the only source of > sync compaction that doesn't give up easily and tries to e.g. migrate movable > pages out of unmovable pageblocks, this might have bad effects on > fragmentation. > I believe this could be done smarter. > >> I might be wrong. If we up_read in khugepaged_scan_pmd(), then if we round >> again >> do the for loop to get the next vma and handle it. Does we do this without >> holding >> the mmap_sem in any mode? >> >> And if the loop end, we have another up_read in breakouterloop. What if we >> have >> released the mmap_sem in collapse_huge_page()? > > collapse_huge_page() is only called from khugepaged_scan_pmd() in the if > (ret) condition. And khugepaged_scan_mm_slot() has similar if (ret) for the > return value of khugepaged_scan_pmd() to break out of the loop (and not doing > up_read() again). So I think this is correct and moving up_read from > khugepaged_alloc_page() to collapse_huge_page() wouldn't > change this?
Ah, right. > > > . > -- Thanks. Zhang Yanfei -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/