On 06/23/2014 05:52 PM, Vlastimil Babka wrote:
> On 06/23/2014 07:39 AM, Zhang Yanfei wrote:
>> Hello
>>
>> On 06/21/2014 01:45 AM, Kirill A. Shutemov wrote:
>>> On Fri, Jun 20, 2014 at 05:49:31PM +0200, Vlastimil Babka wrote:
>>>> When allocating huge page for collapsing, khugepaged currently holds 
>>>> mmap_sem
>>>> for reading on the mm where collapsing occurs. Afterwards the read lock is
>>>> dropped before write lock is taken on the same mmap_sem.
>>>>
>>>> Holding mmap_sem during whole huge page allocation is therefore useless, 
>>>> the
>>>> vma needs to be rechecked after taking the write lock anyway. Furthemore, 
>>>> huge
>>>> page allocation might involve a rather long sync compaction, and thus block
>>>> any mmap_sem writers and i.e. affect workloads that perform frequent 
>>>> m(un)map
>>>> or mprotect oterations.
>>>>
>>>> This patch simply releases the read lock before allocating a huge page. It
>>>> also deletes an outdated comment that assumed vma must be stable, as it was
>>>> using alloc_hugepage_vma(). This is no longer true since commit 9f1b868a13
>>>> ("mm: thp: khugepaged: add policy for finding target node").
>>>
>>> There is no point in touching ->mmap_sem in khugepaged_alloc_page() at
>>> all. Please, move up_read() outside khugepaged_alloc_page().
>>>
> 
> Well there's also currently no point in passing several parameters to 
> khugepaged_alloc_page(). So I could clean it up as well, but I imagine later 
> we would perhaps reintroduce them back, as I don't think the current 
> situation is ideal for at least two reasons.
> 
> 1. If you read commit 9f1b868a13 ("mm: thp: khugepaged: add policy for 
> finding target node"), it's based on a report where somebody found that 
> mempolicy is not observed properly when collapsing THP's. But the 'policy' 
> introduced by the commit isn't based on real mempolicy, it might just under 
> certain conditions results in an interleave, which happens to be what the 
> reporter was trying.
> 
> So ideally, it should be making node allocation decisions based on where the 
> original 4KB pages are located. For example, allocate a THP only if all the 
> 4KB pages are on the same node. That would also automatically obey any policy 
> that has lead to the allocation of those 4KB pages.
> 
> And for this, it will need again the parameters and mmap_sem in read mode. It 
> would be however still a good idea to drop mmap_sem before the allocation 
> itself, since compaction/reclaim might take some time...
> 
> 2. (less related) I'd expect khugepaged to first allocate a hugepage and then 
> scan for collapsing. Yes there's khugepaged_prealloc_page, but that only does 
> something on !NUMA systems and these are not the future.
> Although I don't have the data, I expect allocating a hugepage is a bigger 
> issue than finding something that could be collapsed. So why scan for 
> collapsing if in the end I cannot allocate a hugepage? And if I really cannot 
> find something to collapse, would e.g. caching a single hugepage per node be 
> a big hit? Also, if there's really nothing to collapse, then it means 
> khugepaged won't compact. And since khugepaged is becoming the only source of 
> sync compaction that doesn't give up easily and tries to e.g. migrate movable 
> pages out of unmovable pageblocks, this might have bad effects on 
> fragmentation.
> I believe this could be done smarter.
> 
>> I might be wrong. If we up_read in khugepaged_scan_pmd(), then if we round 
>> again
>> do the for loop to get the next vma and handle it. Does we do this without 
>> holding
>> the mmap_sem in any mode?
>>
>> And if the loop end, we have another up_read in breakouterloop. What if we 
>> have
>> released the mmap_sem in collapse_huge_page()?
> 
> collapse_huge_page() is only called from khugepaged_scan_pmd() in the if 
> (ret) condition. And khugepaged_scan_mm_slot() has similar if (ret) for the 
> return value of khugepaged_scan_pmd() to break out of the loop (and not doing 
> up_read() again). So I think this is correct and moving up_read from 
> khugepaged_alloc_page() to collapse_huge_page() wouldn't
> change this?

Ah, right.

> 
> 
> .
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to