On 4 Oct 2018, at 16:17, David Rientjes wrote:

> On Wed, 26 Sep 2018, Kirill A. Shutemov wrote:
>
>> On Tue, Sep 25, 2018 at 02:03:26PM +0200, Michal Hocko wrote:
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index c3bc7e9c9a2a..c0bcede31930 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -629,21 +629,40 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct 
>>> vm_fault *vmf,
>>>   *     available
>>>   * never: never stall for any thp allocation
>>>   */
>>> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct 
>>> *vma)
>>> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct 
>>> *vma, unsigned long addr)
>>>  {
>>>     const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>>> +   gfp_t this_node = 0;
>>> +
>>> +#ifdef CONFIG_NUMA
>>> +   struct mempolicy *pol;
>>> +   /*
>>> +    * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
>>> +    * specified, to express a general desire to stay on the current
>>> +    * node for optimistic allocation attempts. If the defrag mode
>>> +    * and/or madvise hint requires the direct reclaim then we prefer
>>> +    * to fallback to other node rather than node reclaim because that
>>> +    * can lead to excessive reclaim even though there is free memory
>>> +    * on other nodes. We expect that NUMA preferences are specified
>>> +    * by memory policies.
>>> +    */
>>> +   pol = get_vma_policy(vma, addr);
>>> +   if (pol->mode != MPOL_BIND)
>>> +           this_node = __GFP_THISNODE;
>>> +   mpol_cond_put(pol);
>>> +#endif
>>
>> I'm not very good with NUMA policies. Could you explain in more details how
>> the code above is equivalent to the code below?
>>
>
> It breaks mbind() because new_page() is now using numa_node_id() to
> allocate migration targets for instead of using the mempolicy.  I'm not
> sure that this patch was tested for mbind().

I do not see mbind() is broken. With both patches applied, I ran
"numactl -N 0 memhog -r1 4096m membind 1" and saw all pages are allocated
in Node 1 not Node 0, which is returned by numa_node_id().

From the source code, in alloc_pages_vma(), the nodemask is generated
from the memory policy (i.e. mbind in the case above), which only has
the nodes specified by mbind(). Then, __alloc_pages_nodemask() only uses
the zones from the nodemask. The numa_node_id() return value will be
ignored in the actual page allocation process if mbind policy is applied.

Let me know if I miss anything.


--
Best Regards
Yan Zi

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to