from:"Anshuman Khandual"

Re: [PATCH 1/4] mm/hugetlb: Enable PUD level huge page migration

2018-10-10 Thread Anshuman Khandual




On 10/10/2018 03:09 PM, Michal Hocko wrote:
> On Wed 10-10-18 08:39:22, Anshuman Khandual wrote:
> [...]
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index 9df1d59..4bcbf1e 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -504,6 +504,16 @@ static inline bool hugepage_migration_supported(struct 
>> hstate *h)
>> return arch_hugetlb_migration_supported(h);
>>  }
>>  
>> +static inline bool hugepage_movable_supported(struct hstate *h)
>> +{
>> +   if (!hugepage_migration_supported(h)) --> calls arch override 
>> restricting the set
>> +   return false;
>> +
>> +   if (hstate_is_gigantic(h)> restricts the set further
>> +   return false;
>> +   return true;
>> +}
>> +
>>  static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
>>struct mm_struct *mm, pte_t *pte)
>>  {
>> @@ -600,6 +610,11 @@ static inline bool hugepage_migration_supported(struct 
>> hstate *h)
>> return false;
>>  }
>>  
>> +static inline bool hugepage_movable_supported(struct hstate *h)
>> +{
>> +   return false;
>> +}
>> +
>>  static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
>>struct mm_struct *mm, pte_t *pte)
>>  {
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 3c21775..a5a111d 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -919,7 +919,7 @@ static struct page *dequeue_huge_page_nodemask(struct 
>> hstate *h, gfp_t gfp_mask,
>>  /* Movability of hugepages depends on migration support. */
>>  static inline gfp_t htlb_alloc_mask(struct hstate *h)
>>  {
>> -   if (hugepage_migration_supported(h))
>> +   if (hugepage_movable_supported(h))
>> return GFP_HIGHUSER_MOVABLE;
>> else
>> return GFP_HIGHUSER;
> 
> Exactly what I've had in mind. It would be great to have a comment in
> hugepage_movable_supported to explain why we are not supporting giga
> pages even though they are migrateable and why we need that distinction.
sure, will do.

> 
>> The above patch is in addition to the following later patch in the series.
> [...]
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index 9c1b77f..9df1d59 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -479,18 +479,29 @@ static inline pgoff_t basepage_index(struct page *page)
>>  extern int dissolve_free_huge_page(struct page *page);
>>  extern int dissolve_free_huge_pages(unsigned long start_pfn,
>> unsigned long end_pfn);
>> -static inline bool hugepage_migration_supported(struct hstate *h)
>> -{
>> +
>>  #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
>> +#ifndef arch_hugetlb_migration_supported
>> +static inline bool arch_hugetlb_migration_supported(struct hstate *h)
>> +{
>> if ((huge_page_shift(h) == PMD_SHIFT) ||
>> (huge_page_shift(h) == PUD_SHIFT) ||
>> (huge_page_shift(h) == PGDIR_SHIFT))
>> return true;
>> else
>> return false;
>> +}
>> +#endif
>>  #else
>> +static inline bool arch_hugetlb_migration_supported(struct hstate *h)
>> +{
>> return false;
>> +}
>>  #endif
>> +
>> +static inline bool hugepage_migration_supported(struct hstate *h)
>> +{
>> +   return arch_hugetlb_migration_supported(h);
>>  }
> 
> Yes making hugepage_migration_supported to have an arch override is
> definitely the right thing to do. Whether the above approach rather than
> a weak symbol is better is a matter of taste and I do not feel strongly
> about that.
Okay then, will carry this forward and re-spin the patch series. Thank you
for your detailed review till now.

Re: [PATCH] arm64/numa: Add more vetting in numa_set_distance()

2018-10-29 Thread Anshuman Khandual




On 10/29/2018 06:02 PM, John Garry wrote:
> On 29/10/2018 12:16, Will Deacon wrote:
>> On Mon, Oct 29, 2018 at 12:14:09PM +, John Garry wrote:
>>> On 29/10/2018 11:25, Will Deacon wrote:
 On Fri, Oct 26, 2018 at 09:57:47PM +0800, John Garry wrote:
> Currently it is acceptable to set the distance between 2 separate nodes to
> LOCAL_DISTANCE.
>
> Reject this as it is invalid.
>
> This change avoids a crash reported in [1].
>
> [1] https://www.spinics.net/lists/arm-kernel/msg683304.html
>
> Signed-off-by: John Garry 
>
> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> index 146c04c..6092e3d 100644
> --- a/arch/arm64/mm/numa.c
> +++ b/arch/arm64/mm/numa.c
> @@ -335,7 +335,8 @@ void __init numa_set_distance(int from, int to, int 
> distance)
> }
>
> if ((u8)distance != distance ||
> -    (from == to && distance != LOCAL_DISTANCE)) {
> +    (from == to && distance != LOCAL_DISTANCE) ||
> +    (from != to && distance == LOCAL_DISTANCE)) {

 The current code here is more-or-less lifted from the x86 implementation
 of numa_set_distance().
>>>
>>> Right, I did notice this. I didn't think that x86 folks would be so
>>> concerned since they generally only use ACPI, and the ACPI code already
>>> validates these distances in drivers/acpi/numa.c: slit_valid() [unlike OF
>>> code].
>>>
>>>  I think we should either factor out the sanity check
 into a core helper or make the core code robust to these funny 
 configurations.
>>>
>>> OK, so to me it would make sense to factor out a sanity check into a core
>>> helper.
>>
>> That, or have the OF code perform the same validation that slit_valid() is
>> doing for ACPI. I'm just trying to avoid other architectures running into
>> this problem down the line.
>>
> 
> Right, OF code should do this validation job if ACPI is doing it (especially 
> since the DT bindings actually specify the distance rules), and not rely on 
> the arch NUMA code to accept/reject numa_set_distance() combinations.

I would say this particular condition checking still falls under arch NUMA init
code sanity check like other basic tests what numa_set_distance() currently does
already but it should not be a necessity for the OF driver to check these. It 
can
choose to check but arch NUMA should check basic things like two different NUMA
nodes should not have LOCAL_DISTANCE as distance like in this case.

(from == to && distance != LOCAL_DISTANCE) ||
(from != to && distance == LOCAL_DISTANCE))


> 
> And, in addition to this, I'd say OF should disable NUMA if given an invalid 
> table (like ACPI does).

Taking a decision to disable NUMA should be with kernel (arch NUMA) once kernel
starts booting. Platform should have sent right values, OF driver trying to
adjust stuff what platform has sent with FDT once the kernel starts booting is
not right. For example "Kernel NUMA wont like the distance factors lets clean
then up before passing on to MM". Disabling NUMA is one such major decision 
which
should be with arch NUMA code not with OF driver.

Re: [PATCH] arm64/numa: Add more vetting in numa_set_distance()

2018-10-29 Thread Anshuman Khandual

On 10/29/2018 08:14 PM, John Garry wrote:
>
>  I think we should either factor out the sanity check
>> into a core helper or make the core code robust to these funny 
>> configurations.
>
> OK, so to me it would make sense to factor out a sanity check into a core
> helper.

 That, or have the OF code perform the same validation that slit_valid() is
 doing for ACPI. I'm just trying to avoid other architectures running into
 this problem down the line.

>>>
>>> Right, OF code should do this validation job if ACPI is doing it 
>>> (especially since the DT bindings actually specify the distance rules), and 
>>> not rely on the arch NUMA code to accept/reject numa_set_distance() 
>>> combinations.
>>
>> I would say this particular condition checking still falls under arch NUMA 
>> init
>> code sanity check like other basic tests what numa_set_distance() currently 
>> does
>> already but it should not be a necessity for the OF driver to check these.
> 
> The checks in the arch NUMA code mean that invalid inter-node distance 
> combinations are ignored.

Right and should not this new test (from != to && distance == LOCAL_DISTANCE) be
one of them as well ? numa_set_distance() updates the table or just throws some
warnings while skipping entries it deems invalid. It would be okay to have this
new check there in addition to others like this patch suggests.

> 
> However, if any entries in the table are invalid, then the whole table can be 
> discarded as none of it can be believed, i.e. it's better to validate the 
> table.
>

Agreed. slit_valid() on the ACPI parsing is currently enforcing that before
acpi_numa_slit_init() which would call into numa_set_distance(). Hence arch
NUMA code numa_set_distance() never had the opportunity to do the sanity
checks as ACPI slit_valid() has completely invalidated the table.

Unlike ACPI path, of_numa_parse_distance_map_v1() does not do any sanity
checks on the distance values parse from the "distance-matrix" property
and all the checks directly falls on numa_set_distance(). This needs to
be fixed in line with ACPI

* If (to == from) ---> distance = LOCAL_DISTANCE
* If (to != from) ---> distance > LOCAL_DISTANCE

At the same time its okay to just enhance numa_set_distance() test coverage
to include this new test. If we would have trusted firmware parsing all the
way, existing basic checks about node range, distance stuff should not have
been there in numa_set_distance(). Hence IMHO even if we fix the OF driver
part, we should include this new check there as well.

> It can
>> choose to check but arch NUMA should check basic things like two different 
>> NUMA
>> nodes should not have LOCAL_DISTANCE as distance like in this case.
>>
>> (from == to && distance != LOCAL_DISTANCE) ||
>>     (from != to && distance == LOCAL_DISTANCE))
>>
>>
>>>
>>> And, in addition to this, I'd say OF should disable NUMA if given an 
>>> invalid table (like ACPI does).
>>
>> Taking a decision to disable NUMA should be with kernel (arch NUMA) once 
>> kernel
>> starts booting. Platform should have sent right values, OF driver trying to
>> adjust stuff what platform has sent with FDT once the kernel starts booting 
>> is
>> not right. For example "Kernel NUMA wont like the distance factors lets clean
>> then up before passing on to MM".
> 
> Sorry, but I don't know who was advocating this.

I was just giving an example. Invalidating NUMA distance table during firmware
table (ACPI or FDT) parsing forces arm64_numa_init() to fall back on dummy NUMA
node which is like disabling NUMA. But that is the current semantics with ACPI
parsing which I overlooked. Fixing OF driver to do the same wont extend this
any further, hence my previous concern does not stand valid.

> 
> Disabling NUMA is one such major decision which
>> should be with arch NUMA code not with OF driver.
> 
> I meant parsing the table would fail, so arch NUMA would fall back on dummy 
> NUMA.

Right and ACPI parsing does that and can force a fallback on a dummy NUMA node.

Re: [PATCH] arm64/numa: Add more vetting in numa_set_distance()

2018-10-29 Thread Anshuman Khandual




On 10/29/2018 08:18 PM, Will Deacon wrote:
> On Mon, Oct 29, 2018 at 06:15:42PM +0530, Anshuman Khandual wrote:
>> On 10/29/2018 06:02 PM, John Garry wrote:
>>> On 29/10/2018 12:16, Will Deacon wrote:
>>>> On Mon, Oct 29, 2018 at 12:14:09PM +, John Garry wrote:
>>>>> On 29/10/2018 11:25, Will Deacon wrote:
>>>>>> On Fri, Oct 26, 2018 at 09:57:47PM +0800, John Garry wrote:
>>>>>>> Currently it is acceptable to set the distance between 2 separate nodes 
>>>>>>> to
>>>>>>> LOCAL_DISTANCE.
>>>>>>>
>>>>>>> Reject this as it is invalid.
>>>>>>>
>>>>>>> This change avoids a crash reported in [1].
>>>>>>>
>>>>>>> [1] https://www.spinics.net/lists/arm-kernel/msg683304.html
>>>>>>>
>>>>>>> Signed-off-by: John Garry 
>>>>>>>
>>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>>>>>> index 146c04c..6092e3d 100644
>>>>>>> --- a/arch/arm64/mm/numa.c
>>>>>>> +++ b/arch/arm64/mm/numa.c
>>>>>>> @@ -335,7 +335,8 @@ void __init numa_set_distance(int from, int to, int 
>>>>>>> distance)
>>>>>>> }
>>>>>>>
>>>>>>> if ((u8)distance != distance ||
>>>>>>> -    (from == to && distance != LOCAL_DISTANCE)) {
>>>>>>> +    (from == to && distance != LOCAL_DISTANCE) ||
>>>>>>> +    (from != to && distance == LOCAL_DISTANCE)) {
>>>>>>
>>>>>> The current code here is more-or-less lifted from the x86 implementation
>>>>>> of numa_set_distance().
>>>>>
>>>>> Right, I did notice this. I didn't think that x86 folks would be so
>>>>> concerned since they generally only use ACPI, and the ACPI code already
>>>>> validates these distances in drivers/acpi/numa.c: slit_valid() [unlike OF
>>>>> code].
>>>>>
>>>>>  I think we should either factor out the sanity check
>>>>>> into a core helper or make the core code robust to these funny 
>>>>>> configurations.
>>>>>
>>>>> OK, so to me it would make sense to factor out a sanity check into a core
>>>>> helper.
>>>>
>>>> That, or have the OF code perform the same validation that slit_valid() is
>>>> doing for ACPI. I'm just trying to avoid other architectures running into
>>>> this problem down the line.
>>>>
>>>
>>> Right, OF code should do this validation job if ACPI is doing it
>>> (especially since the DT bindings actually specify the distance rules),
>>> and not rely on the arch NUMA code to accept/reject numa_set_distance()
>>> combinations.
>>
>> I would say this particular condition checking still falls under arch NUMA 
>> init
>> code sanity check like other basic tests what numa_set_distance() currently 
>> does
>> already but it should not be a necessity for the OF driver to check these. 
>> It can
>> choose to check but arch NUMA should check basic things like two different 
>> NUMA
>> nodes should not have LOCAL_DISTANCE as distance like in this case.
>>
>>  (from == to && distance != LOCAL_DISTANCE) ||
>>  (from != to && distance == LOCAL_DISTANCE))
>>
>>
>>>
>>> And, in addition to this, I'd say OF should disable NUMA if given an
>>> invalid table (like ACPI does).
>>
>> Taking a decision to disable NUMA should be with kernel (arch NUMA) once 
>> kernel
>> starts booting. Platform should have sent right values, OF driver trying to
>> adjust stuff what platform has sent with FDT once the kernel starts booting 
>> is
>> not right. For example "Kernel NUMA wont like the distance factors lets clean
>> then up before passing on to MM". Disabling NUMA is one such major decision 
>> which
>> should be with arch NUMA code not with OF driver.
> 
> I don't fully understand what you're getting at here, but why would the
> check posted by John be arch-specific? It's already done in the core code
> for ACPI, so there's a discrepancy between ACPI and FDT that should be
> resolved. I'd also argue that the subtleties of this check are actually
> based on what the core code is willing to accept in terms of the NUMA
> description, so it's also the best place to enforce it.

Agreed. I had overlooked the existing semantics with respect to ACPI parsing.
Yes, there is a discrepancy with respect to FDT which should be fixed. But
IMHO its also worth to enhance numa_set_distance() checks with this proposed
new check as well.

Re: [PATCH] mm/thp: Correctly differentiate between mapped THP and PMD migration entry

2018-10-16 Thread Anshuman Khandual




On 10/15/2018 06:23 AM, Zi Yan wrote:
> On 12 Oct 2018, at 4:00, Anshuman Khandual wrote:
> 
>> On 10/10/2018 06:13 PM, Zi Yan wrote:
>>> On 10 Oct 2018, at 0:05, Anshuman Khandual wrote:
>>>
>>>> On 10/09/2018 07:28 PM, Zi Yan wrote:
>>>>> cc: Naoya Horiguchi (who proposed to use !_PAGE_PRESENT && !_PAGE_PSE for 
>>>>> x86
>>>>> PMD migration entry check)
>>>>>
>>>>> On 8 Oct 2018, at 23:58, Anshuman Khandual wrote:
>>>>>
>>>>>> A normal mapped THP page at PMD level should be correctly differentiated
>>>>>> from a PMD migration entry while walking the page table. A mapped THP 
>>>>>> would
>>>>>> additionally check positive for pmd_present() along with pmd_trans_huge()
>>>>>> as compared to a PMD migration entry. This just adds a new conditional 
>>>>>> test
>>>>>> differentiating the two while walking the page table.
>>>>>>
>>>>>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>>>>>> Signed-off-by: Anshuman Khandual 
>>>>>> ---
>>>>>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
>>>>>> exclusive which makes the current conditional block work for both mapped
>>>>>> and migration entries. This is not same with arm64 where pmd_trans_huge()
>>>>>
>>>>> !pmd_present() && pmd_trans_huge() is used to represent THPs under 
>>>>> splitting,
>>>>
>>>> Not really if we just look at code in the conditional blocks.
>>>
>>> Yeah, I explained it wrong above. Sorry about that.
>>>
>>> In x86, pmd_present() checks (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE),
>>> thus, it returns true even if the present bit is cleared but PSE bit is set.
>>
>> Okay.
>>
>>> This is done so, because THPs under splitting are regarded as present in 
>>> the kernel
>>> but not present when a hardware page table walker checks it.
>>
>> Okay.
>>
>>>
>>> For PMD migration entry, which should be regarded as not present, if PSE bit
>>> is set, which makes pmd_trans_huge() returns true, like ARM64 does, all
>>> PMD migration entries will be regarded as present
>>
>> Okay to make pmd_present() return false pmd_trans_huge() has to return false
>> as well. Is there anything which can be done to get around this problem on
>> X86 ? pmd_trans_huge() returning true for a migration entry sounds logical.
>> Otherwise we would revert the condition block order to accommodate both the
>> implementation for pmd_trans_huge() as suggested by Kirill before or just
>> consider this patch forward.
>>
>> Because I am not really sure yet about the idea of getting pmd_present()
>> check into pmd_trans_huge() on arm64 just to make it fit into this semantics
>> as suggested by Will. If a PMD is trans huge page or not should not depend on
>> whether it is present or not.
> 
> In terms of THPs, we have three cases: a present THP, a THP under splitting,
> and a THP under migration. pmd_present() and pmd_trans_huge() both return true
> for a present THP and a THP under splitting, because they discover _PAGE_PSE 
> bit

Then how do we differentiate between a mapped THP and a splitting THP.

> is set for both cases, whereas they both return false for a THP under 
> migration.
> You want to change them to make pmd_trans_huge() returns true for a THP under 
> migration
> instead of false to help ARM64’s support for THP migration.
I am just trying to understand the rationale behind this semantics and see where
it should be fixed.

I think the fundamental problem here is that THP under split has been difficult
to be re-presented through the available helper functions and in turn PTE bits.

The following checks

1) pmd_present()
2) pmd_trans_huge()

Represent three THP states

1) Mapped THP   (pmd_present && pmd_trans_huge)
2) Splitting THP(pmd_present && pmd_trans_huge)
3) Migrating THP(!pmd_present && !pmd_trans_huge)

The problem is if we make pmd_trans_huge() return true for all the three states
which sounds logical because they are all still trans huge PMD, then 
pmd_present()
can only represent two states not three as required.

> 
> For x86, this change requires:
> 1. changing the condition in pmd_trans_huge(), so that it returns true for
> PMD migration entries;
> 2. changing the code, which calls pmd_trans_huge(), to match the new logic.
Ca

Re: [PATCH] mm/thp: Correctly differentiate between mapped THP and PMD migration entry

2018-10-16 Thread Anshuman Khandual




On 10/15/2018 02:02 PM, Aneesh Kumar K.V wrote:
> On 10/12/18 1:32 PM, Anshuman Khandual wrote:
>>
>>
>> On 10/09/2018 06:48 PM, Will Deacon wrote:
>>> On Tue, Oct 09, 2018 at 04:04:21PM +0300, Kirill A. Shutemov wrote:
>>>> On Tue, Oct 09, 2018 at 09:28:58AM +0530, Anshuman Khandual wrote:
>>>>> A normal mapped THP page at PMD level should be correctly differentiated
>>>>> from a PMD migration entry while walking the page table. A mapped THP 
>>>>> would
>>>>> additionally check positive for pmd_present() along with pmd_trans_huge()
>>>>> as compared to a PMD migration entry. This just adds a new conditional 
>>>>> test
>>>>> differentiating the two while walking the page table.
>>>>>
>>>>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>>>>> Signed-off-by: Anshuman Khandual 
>>>>> ---
>>>>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
>>>>> exclusive which makes the current conditional block work for both mapped
>>>>> and migration entries. This is not same with arm64 where pmd_trans_huge()
>>>>> returns positive for both mapped and migration entries. Could some one
>>>>> please explain why pmd_trans_huge() has to return false for migration
>>>>> entries which just install swap bits and its still a PMD ?
>>>>
>>>> I guess it's just a design choice. Any reason why arm64 cannot do the
>>>> same?
>>>
>>> Anshuman, would it work to:
>>>
>>> #define pmd_trans_huge(pmd) (pmd_present(pmd) && !(pmd_val(pmd) & 
>>> PMD_TABLE_BIT))
>> yeah this works but some how does not seem like the right thing to do
>> but can be the very last option.
>>
> 
> 
> There can be other code paths that makes that assumption. I ended up doing 
> the below for pmd_trans_huge on ppc64.
>

Yeah, did see that in one of the previous proposals.

https://patchwork.kernel.org/patch/10544291/

But the existing semantics does not look right and makes vague assumptions.
Zi Yan has already asked Andrea for his input in this regard on the next
thread. So I guess while being here, its a good idea to revisit existing
semantics and it's assumptions before fixing it in arch specific helpers.

- Anshuman 


> /*
>  * Only returns true for a THP. False for pmd migration entry.
>  * We also need to return true when we come across a pte that
>  * in between a thp split. While splitting THP, we mark the pmd
>  * invalid (pmdp_invalidate()) before we set it with pte page
>  * address. A pmd_trans_huge() check against a pmd entry during that time
>  * should return true.
>  * We should not call this on a hugetlb entry. We should check for HugeTLB
>  * entry using vma->vm_flags
>  * The page table walk rule is explained in Documentation/vm/transhuge.rst
>  */
> static inline int pmd_trans_huge(pmd_t pmd)
> {
> if (!pmd_present(pmd))
>     return false;
> 
> if (radix_enabled())
>     return radix__pmd_trans_huge(pmd);
> return hash__pmd_trans_huge(pmd);
> }
> 
> -aneesh
>

Re: [PATCH V2 0/5] arm64/mm: Enable HugeTLB migration

2018-10-17 Thread Anshuman Khandual




On 10/12/2018 09:29 AM, Anshuman Khandual wrote:
> This patch series enables HugeTLB migration support for all supported
> huge page sizes at all levels including contiguous bit implementation.
> Following HugeTLB migration support matrix has been enabled with this
> patch series. All permutations have been tested except for the 16GB.
> 
>  CONT PTEPMDCONT PMDPUD
>  ------
> 4K: 64K 2M 32M 1G
> 16K: 2M32M  1G
> 64K: 2M   512M 16G
> 
> First the series adds migration support for PUD based huge pages. It
> then adds a platform specific hook to query an architecture if a
> given huge page size is supported for migration while also providing
> a default fallback option preserving the existing semantics which just
> checks for (PMD|PUD|PGDIR)_SHIFT macros. The last two patches enables
> HugeTLB migration on arm64 and subscribe to this new platform specific
> hook by defining an override.
> 
> The second patch differentiates between movability and migratability
> aspects of huge pages and implements hugepage_movable_supported() which
> can then be used during allocation to decide whether to place the huge
> page in movable zone or not.
> 
> Changes in V2:
> 
> - Added a new patch which differentiates migratability and movability
>   of huge pages and implements hugepage_movable_supported() function
>   as suggested by Michal Hocko.

Hello Andrew/Michal/Mike/Naoya/Catalin,

Just checking for an update. Does this series looks okay ?

- Anshuman

Re: [PATCH] mm/thp: Correctly differentiate between mapped THP and PMD migration entry

2018-11-06 Thread Anshuman Khandual




On 11/06/2018 06:05 AM, Will Deacon wrote:
> On Fri, Nov 02, 2018 at 11:45:00AM +0530, Anshuman Khandual wrote:
>> On 10/17/2018 07:39 AM, Andrea Arcangeli wrote:
>>> What we need to do during split is an invalidate of the huge TLB.
>>> There's no pmd_trans_splitting anymore, so we only clear the present
>>> bit in the PTE despite pmd_present still returns true (just like
>>> PROT_NONE, nothing new in this respect). pmd_present never meant the
>>
>> On arm64, the problem is that pmd_present() is tied with pte_present() which
>> checks for PTE_VALID (also PTE_PROT_NONE) but which gets cleared during PTE
>> invalidation. pmd_present() returns false just after the first step of PMD
>> splitting. So pmd_present() needs to be decoupled from PTE_VALID which is
>> same as PMD_SECT_VALID and instead should depend upon a pte bit which sticks
>> around like PAGE_PSE as in case of x86. I am working towards a solution.
> 
> Could we not just go via a PROT_NONE mapping during the split, instead of
> having to allocate a new software bit to treat these invalid ptes as
> present?

The problem might occur during page fault (i.e __handle_mm_fault). As discussed
previously on this thread any potential PTE sticky bit would be used for both
pmd_trans_huge() and pmd_present() wrappers to maintain existing semantics. At
present, PMD state analysis during page fault has conditional block like this.

if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
return do_huge_pmd_numa_page(&vmf, orig_pmd);

Using PROT_NONE for pmd_trans_huge() might force PMD page fault to go through
NUMA fault handling all the time as both pmd_trans_huge() and pmd_protnone()
will return true in that situation.

Re: [RFC PATCH 4/5] mm, memory_hotplug: print reason for the offlining failure

2018-11-07 Thread Anshuman Khandual




On 11/07/2018 03:48 PM, Michal Hocko wrote:
> From: Michal Hocko 
> 
> The memory offlining failure reporting is inconsistent and insufficient.
> Some error paths simply do not report the failure to the log at all.
> When we do report there are no details about the reason of the failure
> and there are several of them which makes memory offlining failures
> hard to debug.
> 
> Make sure that the
>   memory offlining [mem %#010llx-%#010llx] failed
> message is printed for all failures and also provide a short textual
> reason for the failure e.g.
> 
> [ 1984.506184] rac1 kernel: memory offlining [mem 
> 0x826-0x8267fff] failed due to signal backoff
> 
> this tells us that the offlining has failed because of a signal pending
> aka user intervention.
> 
> Signed-off-by: Michal Hocko 

It might help to enumerate these failure reason strings and use macros.

Re: [RFC PATCH 5/5] mm, memory_hotplug: be more verbose for memory offline failures

2018-11-07 Thread Anshuman Khandual




On 11/07/2018 03:48 PM, Michal Hocko wrote:
> From: Michal Hocko 
> 
> There is only very limited information printed when the memory offlining
> fails:
> [ 1984.506184] rac1 kernel: memory offlining [mem 
> 0x826-0x8267fff] failed due to signal backoff
> 
> This tells us that the failure is triggered by the userspace
> intervention but it doesn't tell us much more about the underlying
> reason. It might be that the page migration failes repeatedly and the
> userspace timeout expires and send a signal or it might be some of the
> earlier steps (isolation, memory notifier) takes too long.
> 
> If the migration failes then it would be really helpful to see which
> page that and its state. The same applies to the isolation phase. If we
> fail to isolate a page from the allocator then knowing the state of the
> page would be helpful as well.
> 
> Dump the page state that fails to get isolated or migrated. This will
> tell us more about the failure and what to focus on during debugging.
> 
> Signed-off-by: Michal Hocko 
> ---
>  mm/memory_hotplug.c | 12 
>  mm/page_alloc.c |  1 +
>  2 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 1badac89c58e..bf214beccda3 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1388,10 +1388,8 @@ do_migrate_range(unsigned long start_pfn, unsigned 
> long end_pfn)
>   page_is_file_cache(page));
>  
>   } else {
> -#ifdef CONFIG_DEBUG_VM
> - pr_alert("failed to isolate pfn %lx\n", pfn);
> + pr_warn("failed to isolate pfn %lx\n", pfn)>
> dump_page(page, "isolation failed");
> -#endif
>   put_page(page);
>   /* Because we don't have big zone->lock. we should
>  check this again here. */
> @@ -1411,8 +1409,14 @@ do_migrate_range(unsigned long start_pfn, unsigned 
> long end_pfn)
>   /* Allocate a new page from the nearest neighbor node */
>   ret = migrate_pages(&source, new_node_page, NULL, 0,
>   MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
> - if (ret)
> + if (ret) {
> + list_for_each_entry(page, &source, lru) {
> + pr_warn("migrating pfn %lx failed ",
> +page_to_pfn(page), ret);

Seems like pr_warn() needs to have %d in here to print 'ret'. Though
dumping return code from migrate_pages() makes sense, wondering if
it is required for each and every page which failed to migrate here
or just one instance is enough.

> + dump_page(page, NULL);
> + }

s/NULL/failed to migrate/ for dump_page().

>   putback_movable_pages(&source);
> + }
>   }
>  out:
>   return ret;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a919ba5cb3c8..23267767bf98 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7845,6 +7845,7 @@ bool has_unmovable_pages(struct zone *zone, struct page 
> *page, int count,
>   return false;
>  unmovable:
>   WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE);
> + dump_page(pfn_to_page(pfn+iter), "has_unmovable_pages");

s/has_unmovable_pages/is unmovable/

If we eally care about the function name, then dump_page() should be
followed by dump_stack() like the case in some other instances.

>   return true;

This will be dumped from HugeTLB and CMA allocation paths as well through
alloc_contig_range(). But it should be okay as those occurrences should be
rare and dumping page state then will also help.

Re: [RFC PATCH 5/5] mm, memory_hotplug: be more verbose for memory offline failures

2018-11-08 Thread Anshuman Khandual




On 11/08/2018 01:42 PM, Michal Hocko wrote:
> On Thu 08-11-18 12:46:47, Anshuman Khandual wrote:
>>
>>
>> On 11/07/2018 03:48 PM, Michal Hocko wrote:
> [...]
>>> @@ -1411,8 +1409,14 @@ do_migrate_range(unsigned long start_pfn, unsigned 
>>> long end_pfn)
>>> /* Allocate a new page from the nearest neighbor node */
>>> ret = migrate_pages(&source, new_node_page, NULL, 0,
>>> MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
>>> -   if (ret)
>>> +   if (ret) {
>>> +   list_for_each_entry(page, &source, lru) {
>>> +   pr_warn("migrating pfn %lx failed ",
>>> +  page_to_pfn(page), ret);
>>
>> Seems like pr_warn() needs to have %d in here to print 'ret'.
> 
> Dohh. Rebase hickup. You are right ret:%d got lost on the way.
> 
>> Though
>> dumping return code from migrate_pages() makes sense, wondering if
>> it is required for each and every page which failed to migrate here
>> or just one instance is enough.
> 
> Does it matter enough to special case one printk?
I just imagined how a bunch of prints will look like when multiple pages
failed to migrate probably for the same reason. But I guess its okay.

> 
>>> +   dump_page(page, NULL);
>>> +   }
>>
>> s/NULL/failed to migrate/ for dump_page().
> 
> Yes, makes sense.
> 
>>
>>> putback_movable_pages(&source);
>>> +   }
>>> }
>>>  out:
>>> return ret;
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index a919ba5cb3c8..23267767bf98 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -7845,6 +7845,7 @@ bool has_unmovable_pages(struct zone *zone, struct 
>>> page *page, int count,
>>> return false;
>>>  unmovable:
>>> WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE);
>>> +   dump_page(pfn_to_page(pfn+iter), "has_unmovable_pages");
>>
>> s/has_unmovable_pages/is unmovable/
> 
> OK
> 
>> If we eally care about the function name, then dump_page() should be
>> followed by dump_stack() like the case in some other instances.
>>
>>> return true;
>>
>> This will be dumped from HugeTLB and CMA allocation paths as well through
>> alloc_contig_range(). But it should be okay as those occurrences should be
>> rare and dumping page state then will also help.
> 
> yes
> 
> Thanks and here is the incremental fix:
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index bf214beccda3..820397e18e59 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1411,9 +1411,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long 
> end_pfn)
>   MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
>   if (ret) {
>   list_for_each_entry(page, &source, lru) {
> - pr_warn("migrating pfn %lx failed ",
> + pr_warn("migrating pfn %lx failed ret:%d ",
>  page_to_pfn(page), ret);
> - dump_page(page, NULL);
> + dump_page(page, "migration failure");
>   }
>   putback_movable_pages(&source);
>   }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 23267767bf98..ec2c7916dc2d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7845,7 +7845,7 @@ bool has_unmovable_pages(struct zone *zone, struct page 
> *page, int count,
>   return false;
>  unmovable:
>   WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE);
> - dump_page(pfn_to_page(pfn+iter), "has_unmovable_pages");
> + dump_page(pfn_to_page(pfn+iter), "unmovable page");
>   return true;
>  }

It looks good.

Re: [PATCH v2] of, numa: Validate some distance map rules

2018-11-08 Thread Anshuman Khandual

On 11/08/2018 03:47 PM, John Garry wrote:
> Currently the NUMA distance map parsing does not validate the distance
> table for the distance-matrix rules 1-2 in [1].

Right. 

> 
> However the arch NUMA code may enforce some of these rules, but not all.

Either it does enforce all these rules again or it does not. What is the
purpose it serves when enforcing parts of it ! Rather it should focus on
looking for boundary conditions (e.g maximum number of nodes supported
on a platform against whats present on the table) which are important
from kernel point of view.

> Such is the case for the arm64 port, which does not enforce the rule that
> the distance between separates nodes cannot equal LOCAL_DISTANCE.
> 
> The patch adds the following rules validation:
> - distance of node to self equals LOCAL_DISTANCE
> - distance of separate nodes > LOCAL_DISTANCE

Which exactly corresponds to first two rules as mentioned in the DT doc
(Documents/devicetree/bindings/numa.txt)

1. Each entry represents distance from first node to second node.
The distances are equal in either direction.
2. The distance from a node to self (local distance) is represented
with value 10 and all internode distance should be represented with
a value greater than 10.

> 
> This change avoids a yet-unresolved crash reported in [2].
> 
> A note on dealing with symmetrical distances between nodes:
> 
> Validating symmetrical distances between nodes is difficult. If it were
> mandated in the bindings that every distance must be recorded in the
> table, then it would be easy. However, it isn't.

But if [a, b] and [b, a] both are present, they must be equal.

> 
> In addition to this, it is also possible to record [b, a] distance only
> (and not [a, b]). So, when processing the table for [b, a], we cannot
> assert that current distance of [a, b] != [b, a] as invalid, as [a, b]
> distance may not be present in the table and current distance would be
> default at REMOTE_DISTANCE.

Falling back to REMOTE_DISTANCE is right in case one of the symmetrical
distances is not present. But it must abort if they are present and not
identical. That rule should be enforced here while processing DT.

> 
> As such, we maintain the policy that we overwrite distance [a, b] = [b, a]
> for b > a. This policy is different to kernel ACPI SLIT validation, which
> allows non-symmetrical distances (ACPI spec SLIT rules allow it). However,
> the distance debug message is dropped as it may be misleading (for a distance
> which is later overwritten).

We could override but if the DT passed table is inconsistent then it must
abort as non-symmetrical distance is not supported. Hence overriding the
distance should happen only in case the inverse set is absent (probably
outside the first loop).

> 
> Some final notes on semantics:
> 
> - It is implied that it is the responsibility of the arch NUMA code to
>   reset the NUMA distance map for an error in distance map parsing.

Right.

> 
> - It is the responsibility of the FW NUMA topology parsing (whether OF or
>   ACPI) to enforce NUMA distance rules, and not arch NUMA code.

This is where we still have some inconsistencies. Arch NUMA code on arm64
should either enforce rule 1 and 2 as stated above or should not at all. As
discussed previously numa_set_distance() enforces first part of rule 2
(self distance should be 10) but not the second part of rule 2 (internode
distance > 10). So either the following condition should be dropped from

(from == to && distance != LOCAL_DISTANCE)

or the following condition should be added to

(from != to && distance <= LOCAL_DISTANCE)

numa_set_distance() in order to conform to the semantics described above.
But thats for a separate discussion how much and what should be enforced
in any given DT compatible arch NUMA node. May not be part of this patch.

> 
> [1] Documents/devicetree/bindings/numa.txt
> [2] https://www.spinics.net/lists/arm-kernel/msg683304.html
> 
> Cc: sta...@vger.kernel.org # 4.7
> Signed-off-by: John Garry 
> Acked-by: Will Deacon 
> ---
> Changes from v1:
> - Add note on masking crash reported
> - Add Will's tag
> - Add cc stable
> 
> diff --git a/drivers/of/of_numa.c b/drivers/of/of_numa.c
> index 35c64a4295e0..fe6b13608e51 100644
> --- a/drivers/of/of_numa.c
> +++ b/drivers/of/of_numa.c
> @@ -104,9 +104,14 @@ static int __init of_numa_parse_distance_map_v1(struct 
> device_node *map)
>   distance = of_read_number(matrix, 1);
>   matrix++;
>  
> + if ((nodea == nodeb && distance != LOCAL_DISTANCE) ||
> + (nodea != nodeb && distance <= LOCAL_DISTANCE)) {
> + pr_err("Invalid distance[node%d -> node%d] = %d\n",
> +nodea, nodeb, distance);
> + return -EINVAL;
> + }
> +
>   numa_set_distance(nodea, nodeb, distance);
> - pr_debug("distance[node%d -> node%d] = %d\n",
> -

Re: [PATCH] arm64/numa: Add more vetting in numa_set_distance()

2018-11-08 Thread Anshuman Khandual




On 11/01/2018 04:57 PM, Will Deacon wrote:
> [ Nit: Please wrap your lines when replying -- I've done it for you here ]
> 
> On Tue, Oct 30, 2018 at 08:16:21AM +0530, Anshuman Khandual wrote:
>> On 10/29/2018 08:14 PM, John Garry wrote:
>>>>>>>  I think we should either factor out the sanity check
>>>>>>>> into a core helper or make the core code robust to these funny 
>>>>>>>> configurations.
>>>>>>>
>>>>>>> OK, so to me it would make sense to factor out a sanity check into a 
>>>>>>> core
>>>>>>> helper.
>>>>>>
>>>>>> That, or have the OF code perform the same validation that slit_valid() 
>>>>>> is
>>>>>> doing for ACPI. I'm just trying to avoid other architectures running into
>>>>>> this problem down the line.
>>>>>>
>>>>>
>>>>> Right, OF code should do this validation job if ACPI is doing it
>>>>> (especially since the DT bindings actually specify the distance
>>>>> rules), and not rely on the arch NUMA code to accept/reject
>>>>> numa_set_distance() combinations.
>>>>
>>>> I would say this particular condition checking still falls under arch NUMA 
>>>> init
>>>> code sanity check like other basic tests what numa_set_distance() 
>>>> currently does
>>>> already but it should not be a necessity for the OF driver to check these.
>>>
>>> The checks in the arch NUMA code mean that invalid inter-node distance
>>> combinations are ignored.
>>
>> Right and should not this new test (from != to && distance == 
>> LOCAL_DISTANCE) be
>> one of them as well ? numa_set_distance() updates the table or just throws 
>> some
>> warnings while skipping entries it deems invalid. It would be okay to have 
>> this
>> new check there in addition to others like this patch suggests.

I missed this response due to some problem in my mail client and re-iterated
some of the same points again on the V2 (https://lkml.org/lkml/2018/11/8/734).
My apologies for the same.

> 
> If we're changing numa_set_distance(), I'd actually be inclined to do the
> opposite of what you're suggesting and move as much of this boilerplate
> checking into the core code. Perhaps we could add something like
> check_fw_numa_topology() and have both ACPI and OF call that so that the
> arch doesn't need to worry about malformed tables at all.

Right although I doubt that we could ever have a common check_fw_numa_topology()
check as the semantics for the table and individual entries there in for DT and
ACPI might be different. But I agree what you are saying.

> 
>>> However, if any entries in the table are invalid, then the whole table
>>> can be discarded as none of it can be believed, i.e. it's better to
>>> validate the table.
>>>
>>
>> Agreed. slit_valid() on the ACPI parsing is currently enforcing that before
>> acpi_numa_slit_init() which would call into numa_set_distance(). Hence arch
>> NUMA code numa_set_distance() never had the opportunity to do the sanity
>> checks as ACPI slit_valid() has completely invalidated the table.
>>
>> Unlike ACPI path, of_numa_parse_distance_map_v1() does not do any sanity
>> checks on the distance values parse from the "distance-matrix" property
>> and all the checks directly falls on numa_set_distance(). This needs to
>> be fixed in line with ACPI
>>
>> * If (to == from) ---> distance = LOCAL_DISTANCE
>> * If (to != from) ---> distance > LOCAL_DISTANCE
>>
>> At the same time its okay to just enhance numa_set_distance() test coverage
>> to include this new test. If we would have trusted firmware parsing all the
>> way, existing basic checks about node range, distance stuff should not have
>> been there in numa_set_distance(). Hence IMHO even if we fix the OF driver
>> part, we should include this new check there as well.
> 
> I don't see what we gain by duplicating the check. In fact, it has a few
> downsides:
> 
>   (1) It confuses the semantics of the API, because it is no longer clear
>   who "owns" the check
> 
>   (2) It duplicates code in each architecture
> 
>   (3) Some clever-cloggs will remove at least some of the duplication in
>   future

Agreed, it has down sides.

> 
> I'm not willing to accept the check in the arm64 code if we update the
> OF code.

Okay, we should remove them instead.

> 
> I think the way forward here is for John to fix the crash he reported by
> adding the check to the OF code. If somebody wants to follow up with
> subsequent patches to move more of the checking out of the arch code, then
> we can review that as a separate series.

Okay. Anyways I had some other comments related to semantics checking at
the OF driver level but I can probably send them out as a separate patch
later. Once again it was my bad to miss this response in the first place.

[PATCH V2 1/5] mm/hugetlb: Enable PUD level huge page migration

2018-10-11 Thread Anshuman Khandual

Architectures like arm64 have PUD level HugeTLB pages for certain configs
(1GB huge page is PUD based on ARM64_4K_PAGES base page size) that can be
enabled for migration. It can be achieved through checking for PUD_SHIFT
order based HugeTLB pages during migration.

Signed-off-by: Anshuman Khandual 
---
 include/linux/hugetlb.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 6b68e34..9c1b77f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -483,7 +483,8 @@ static inline bool hugepage_migration_supported(struct 
hstate *h)
 {
 #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
if ((huge_page_shift(h) == PMD_SHIFT) ||
-   (huge_page_shift(h) == PGDIR_SHIFT))
+   (huge_page_shift(h) == PUD_SHIFT) ||
+   (huge_page_shift(h) == PGDIR_SHIFT))
return true;
else
return false;
-- 
2.7.4

[PATCH V2 0/5] arm64/mm: Enable HugeTLB migration

2018-10-11 Thread Anshuman Khandual

This patch series enables HugeTLB migration support for all supported
huge page sizes at all levels including contiguous bit implementation.
Following HugeTLB migration support matrix has been enabled with this
patch series. All permutations have been tested except for the 16GB.

 CONT PTEPMDCONT PMDPUD
 ------
4K: 64K 2M 32M 1G
16K: 2M32M  1G
64K: 2M   512M 16G

First the series adds migration support for PUD based huge pages. It
then adds a platform specific hook to query an architecture if a
given huge page size is supported for migration while also providing
a default fallback option preserving the existing semantics which just
checks for (PMD|PUD|PGDIR)_SHIFT macros. The last two patches enables
HugeTLB migration on arm64 and subscribe to this new platform specific
hook by defining an override.

The second patch differentiates between movability and migratability
aspects of huge pages and implements hugepage_movable_supported() which
can then be used during allocation to decide whether to place the huge
page in movable zone or not.

Changes in V2:

- Added a new patch which differentiates migratability and movability
  of huge pages and implements hugepage_movable_supported() function
  as suggested by Michal Hocko.

Anshuman Khandual (5):
  mm/hugetlb: Enable PUD level huge page migration
  mm/hugetlb: Distinguish between migratability and movability
  mm/hugetlb: Enable arch specific huge page size support for migration
  arm64/mm: Enable HugeTLB migration
  arm64/mm: Enable HugeTLB migration for contiguous bit HugeTLB pages

 arch/arm64/Kconfig   |  4 
 arch/arm64/include/asm/hugetlb.h |  5 +
 arch/arm64/mm/hugetlbpage.c  | 20 +
 include/linux/hugetlb.h  | 48 +---
 mm/hugetlb.c |  2 +-
 5 files changed, 75 insertions(+), 4 deletions(-)

-- 
2.7.4

[PATCH V2 4/5] arm64/mm: Enable HugeTLB migration

2018-10-11 Thread Anshuman Khandual

Let arm64 subscribe to generic HugeTLB page migration framework. Right now
this only works on the following PMD and PUD level HugeTLB page sizes with
various kernel base page size combinations.

   CONT PTEPMDCONT PMDPUD
   ------
4K: NA 2M NA  1G
16K:NA32M NA
64K:NA   512M NA

Signed-off-by: Anshuman Khandual 
---
 arch/arm64/Kconfig | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1b1a0e9..e54350f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1305,6 +1305,10 @@ config SYSVIPC_COMPAT
def_bool y
depends on COMPAT && SYSVIPC
 
+config ARCH_ENABLE_HUGEPAGE_MIGRATION
+   def_bool y
+   depends on HUGETLB_PAGE && MIGRATION
+
 menu "Power management options"
 
 source "kernel/power/Kconfig"
-- 
2.7.4

[PATCH V2 2/5] mm/hugetlb: Distinguish between migratability and movability

2018-10-11 Thread Anshuman Khandual

During huge page allocation it's migratability is checked to determine if
it should be placed under movable zones with GFP_HIGHUSER_MOVABLE. But the
movability aspect of the huge page could depend on other factors than just
migratability. Movability in itself is a distinct property which should not
be tied with migratability alone.

This differentiates these two and implements an enhanced movability check
which also considers huge page size to determine if it is feasible to be
placed under a movable zone. At present it just checks for gigantic pages
but going forward it can incorporate other enhanced checks.

Suggested-by: Michal Hocko 
Signed-off-by: Anshuman Khandual 
---
 include/linux/hugetlb.h | 30 ++
 mm/hugetlb.c|  2 +-
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 9c1b77f..456cb60 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -493,6 +493,31 @@ static inline bool hugepage_migration_supported(struct 
hstate *h)
 #endif
 }
 
+/*
+ * Movability check is different as compared to migration check.
+ * It determines whether or not a huge page should be placed on
+ * movable zone or not. Movability of any huge page should be
+ * required only if huge page size is supported for migration.
+ * There wont be any reason for the huge page to be movable if
+ * it is not migratable to start with. Also the size of the huge
+ * page should be large enough to be placed under a movable zone
+ * and still feasible enough to be migratable. Just the presence
+ * in movable zone does not make the migration feasible.
+ *
+ * So even though large huge page sizes like the gigantic ones
+ * are migratable they should not be movable because its not
+ * feasible to migrate them from movable zone.
+ */
+static inline bool hugepage_movable_supported(struct hstate *h)
+{
+   if (!hugepage_migration_supported(h))
+   return false;
+
+   if (hstate_is_gigantic(h))
+   return false;
+   return true;
+}
+
 static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
   struct mm_struct *mm, pte_t *pte)
 {
@@ -589,6 +614,11 @@ static inline bool hugepage_migration_supported(struct 
hstate *h)
return false;
 }
 
+static inline bool hugepage_movable_supported(struct hstate *h)
+{
+   return false;
+}
+
 static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
   struct mm_struct *mm, pte_t *pte)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3c21775..a5a111d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -919,7 +919,7 @@ static struct page *dequeue_huge_page_nodemask(struct 
hstate *h, gfp_t gfp_mask,
 /* Movability of hugepages depends on migration support. */
 static inline gfp_t htlb_alloc_mask(struct hstate *h)
 {
-   if (hugepage_migration_supported(h))
+   if (hugepage_movable_supported(h))
return GFP_HIGHUSER_MOVABLE;
else
return GFP_HIGHUSER;
-- 
2.7.4

[PATCH V2 3/5] mm/hugetlb: Enable arch specific huge page size support for migration

2018-10-11 Thread Anshuman Khandual

Architectures like arm64 have HugeTLB page sizes which are different than
generic sizes at PMD, PUD, PGD level and implemented via contiguous bits.
At present these special size HugeTLB pages cannot be identified through
macros like (PMD|PUD|PGDIR)_SHIFT and hence chosen not be migrated.

Enabling migration support for these special HugeTLB page sizes along with
the generic ones (PMD|PUD|PGD) would require identifying all of them on a
given platform. A platform specific hook can precisely enumerate all huge
page sizes supported for migration. Instead of comparing against standard
huge page orders let hugetlb_migration_support() function call a platform
hook arch_hugetlb_migration_support(). Default definition for the platform
hook maintains existing semantics which checks standard huge page order.
But an architecture can choose to override the default and provide support
for a comprehensive set of huge page sizes.

Signed-off-by: Anshuman Khandual 
---
 include/linux/hugetlb.h | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 456cb60..97a2fdb 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -479,18 +479,29 @@ static inline pgoff_t basepage_index(struct page *page)
 extern int dissolve_free_huge_page(struct page *page);
 extern int dissolve_free_huge_pages(unsigned long start_pfn,
unsigned long end_pfn);
-static inline bool hugepage_migration_supported(struct hstate *h)
-{
+
 #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
+#ifndef arch_hugetlb_migration_supported
+static inline bool arch_hugetlb_migration_supported(struct hstate *h)
+{
if ((huge_page_shift(h) == PMD_SHIFT) ||
(huge_page_shift(h) == PUD_SHIFT) ||
(huge_page_shift(h) == PGDIR_SHIFT))
return true;
else
return false;
+}
+#endif
 #else
+static inline bool arch_hugetlb_migration_supported(struct hstate *h)
+{
return false;
+}
 #endif
+
+static inline bool hugepage_migration_supported(struct hstate *h)
+{
+   return arch_hugetlb_migration_supported(h);
 }
 
 /*
-- 
2.7.4

[PATCH V2 5/5] arm64/mm: Enable HugeTLB migration for contiguous bit HugeTLB pages

2018-10-11 Thread Anshuman Khandual

Let arm64 subscribe to the previously added framework in which architecture
can inform whether a given huge page size is supported for migration. This
just overrides the default function arch_hugetlb_migration_supported() and
enables migration for all possible HugeTLB page sizes on arm64. With this,
HugeTLB migration support on arm64 now covers all possible HugeTLB options.

CONT PTEPMDCONT PMDPUD
------
4K:64K  2M32M  1G
16K:2M 32M 1G
64K:2M512M16G

Signed-off-by: Anshuman Khandual 
---
 arch/arm64/include/asm/hugetlb.h |  5 +
 arch/arm64/mm/hugetlbpage.c  | 20 
 2 files changed, 25 insertions(+)

diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
index e73f685..656f70e 100644
--- a/arch/arm64/include/asm/hugetlb.h
+++ b/arch/arm64/include/asm/hugetlb.h
@@ -20,6 +20,11 @@
 
 #include 
 
+#ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
+#define arch_hugetlb_migration_supported arch_hugetlb_migration_supported
+extern bool arch_hugetlb_migration_supported(struct hstate *h);
+#endif
+
 static inline pte_t huge_ptep_get(pte_t *ptep)
 {
return READ_ONCE(*ptep);
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 4eafd9f..28f4795 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -27,6 +27,26 @@
 #include 
 #include 
 
+#ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
+bool arch_hugetlb_migration_supported(struct hstate *h)
+{
+   size_t pagesize = huge_page_size(h);
+
+   switch (pagesize) {
+#ifdef CONFIG_ARM64_4K_PAGES
+   case PUD_SIZE:
+#endif
+   case PMD_SIZE:
+   case CONT_PMD_SIZE:
+   case CONT_PTE_SIZE:
+   return true;
+   }
+   pr_warn("%s: unrecognized huge page size 0x%lx\n",
+   __func__, pagesize);
+   return false;
+}
+#endif
+
 int pmd_huge(pmd_t pmd)
 {
return pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT);
-- 
2.7.4

Re: [PATCH] mm/thp: Correctly differentiate between mapped THP and PMD migration entry

2018-10-12 Thread Anshuman Khandual




On 10/10/2018 06:13 PM, Zi Yan wrote:
> On 10 Oct 2018, at 0:05, Anshuman Khandual wrote:
> 
>> On 10/09/2018 07:28 PM, Zi Yan wrote:
>>> cc: Naoya Horiguchi (who proposed to use !_PAGE_PRESENT && !_PAGE_PSE for 
>>> x86
>>> PMD migration entry check)
>>>
>>> On 8 Oct 2018, at 23:58, Anshuman Khandual wrote:
>>>
>>>> A normal mapped THP page at PMD level should be correctly differentiated
>>>> from a PMD migration entry while walking the page table. A mapped THP would
>>>> additionally check positive for pmd_present() along with pmd_trans_huge()
>>>> as compared to a PMD migration entry. This just adds a new conditional test
>>>> differentiating the two while walking the page table.
>>>>
>>>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>>>> Signed-off-by: Anshuman Khandual 
>>>> ---
>>>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
>>>> exclusive which makes the current conditional block work for both mapped
>>>> and migration entries. This is not same with arm64 where pmd_trans_huge()
>>>
>>> !pmd_present() && pmd_trans_huge() is used to represent THPs under 
>>> splitting,
>>
>> Not really if we just look at code in the conditional blocks.
> 
> Yeah, I explained it wrong above. Sorry about that.
> 
> In x86, pmd_present() checks (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE),
> thus, it returns true even if the present bit is cleared but PSE bit is set.

Okay.

> This is done so, because THPs under splitting are regarded as present in the 
> kernel
> but not present when a hardware page table walker checks it.

Okay.

> 
> For PMD migration entry, which should be regarded as not present, if PSE bit
> is set, which makes pmd_trans_huge() returns true, like ARM64 does, all
> PMD migration entries will be regarded as present

Okay to make pmd_present() return false pmd_trans_huge() has to return false
as well. Is there anything which can be done to get around this problem on
X86 ? pmd_trans_huge() returning true for a migration entry sounds logical.
Otherwise we would revert the condition block order to accommodate both the
implementation for pmd_trans_huge() as suggested by Kirill before or just
consider this patch forward.

Because I am not really sure yet about the idea of getting pmd_present()
check into pmd_trans_huge() on arm64 just to make it fit into this semantics
as suggested by Will. If a PMD is trans huge page or not should not depend on
whether it is present or not.

> 
> My concern is that if ARM64’s pmd_trans_huge() returns true for migration
> entries, unlike x86, there might be bugs triggered in the kernel when
> THP migration is enabled in ARM64.

Right and that is exactly what we are trying to fix with this patch.

>
> Let me know if I explain this clear to you.
> 
>>
>>> since _PAGE_PRESENT is cleared during THP splitting but _PAGE_PSE is not.
>>> See the comment in pmd_present() for x86, in arch/x86/include/asm/pgtable.h
>>
>>
>> if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
>> pvmw->ptl = pmd_lock(mm, pvmw->pmd);
>> if (likely(pmd_trans_huge(*pvmw->pmd))) {
>> if (pvmw->flags & PVMW_MIGRATION)
>> return not_found(pvmw);
>> if (pmd_page(*pvmw->pmd) != page)
>> return not_found(pvmw);
>> return true;
>> } else if (!pmd_present(*pvmw->pmd)) {
>> if (thp_migration_supported()) {
>> if (!(pvmw->flags & PVMW_MIGRATION))
>> return not_found(pvmw);
>> if 
>> (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
>> swp_entry_t entry = 
>> pmd_to_swp_entry(*pvmw->pmd);
>>
>> if (migration_entry_to_page(entry) 
>> != page)
>> return not_found(pvmw);
>> return true;
>> }
>> }
>> return not_found(pvmw);
>> } else {
>> /* THP pmd was split under us: handle on pte level */
>> spin_unlock(pvmw->ptl);
>> pvmw->ptl =

Re: [PATCH] mm/thp: Correctly differentiate between mapped THP and PMD migration entry

2018-10-12 Thread Anshuman Khandual




On 10/09/2018 06:48 PM, Will Deacon wrote:
> On Tue, Oct 09, 2018 at 04:04:21PM +0300, Kirill A. Shutemov wrote:
>> On Tue, Oct 09, 2018 at 09:28:58AM +0530, Anshuman Khandual wrote:
>>> A normal mapped THP page at PMD level should be correctly differentiated
>>> from a PMD migration entry while walking the page table. A mapped THP would
>>> additionally check positive for pmd_present() along with pmd_trans_huge()
>>> as compared to a PMD migration entry. This just adds a new conditional test
>>> differentiating the two while walking the page table.
>>>
>>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>>> Signed-off-by: Anshuman Khandual 
>>> ---
>>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
>>> exclusive which makes the current conditional block work for both mapped
>>> and migration entries. This is not same with arm64 where pmd_trans_huge()
>>> returns positive for both mapped and migration entries. Could some one
>>> please explain why pmd_trans_huge() has to return false for migration
>>> entries which just install swap bits and its still a PMD ?
>>
>> I guess it's just a design choice. Any reason why arm64 cannot do the
>> same?
> 
> Anshuman, would it work to:
> 
> #define pmd_trans_huge(pmd) (pmd_present(pmd) && !(pmd_val(pmd) & 
> PMD_TABLE_BIT))
yeah this works but some how does not seem like the right thing to do
but can be the very last option.

Re: [PATCH 1/4] mm/hugetlb: Enable PUD level huge page migration

2018-10-05 Thread Anshuman Khandual




On 10/03/2018 07:06 PM, Michal Hocko wrote:
> On Wed 03-10-18 18:36:39, Anshuman Khandual wrote:
> [...]
>> So we have two checks here
>>
>> 1) platform specific arch_hugetlb_migration -> In principle go ahead
>>
>> 2) huge_movable() during allocation
>>
>>  - If huge page does not have to be placed on movable zone
>>
>>  - Allocate any where successfully and done !
>>  
>>  - If huge page *should* be placed on a movable zone
>>
>>  - Try allocating on movable zone
>>
>>  - Successfull and done !
>>
>>  - If the new page could not be allocated on movable zone
>>  
>>  - Abort the migration completely
>>
>>  OR
>>
>>  - Warn and fall back to non-movable
> 
> I guess you are still making it more complicated than necessary. The
> later is really only about __GFP_MOVABLE at this stage. I would just
> make it simple for now. We do not have to implement any dynamic
> heuristic right now. All that I am asking for is to split the migrate
> possible part from movable part.
> 
> I should have been more clear about that I guess from my very first
> reply. I do like how you moved the current coarse grained
> hugepage_migration_supported to be more arch specific but I merely
> wanted to point out that we need to do some other changes before we can
> go that route and that thing is to distinguish movable from migration
> supported.
> 
> See my point?

Does the following sound close enough to what you are looking for ?

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 9df1d59..070c419 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -504,6 +504,13 @@ static inline bool hugepage_migration_supported(struct 
hstate *h)
return arch_hugetlb_migration_supported(h);
 }
 
+static inline bool hugepage_movable_required(struct hstate *h)
+{
+   if (hstate_is_gigantic(h))
+   return true;
+   return false;
+}
+
 static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
   struct mm_struct *mm, pte_t *pte)
 {
@@ -600,6 +607,11 @@ static inline bool hugepage_migration_supported(struct 
hstate *h)
return false;
 }
 
+static inline bool hugepage_movable_required(struct hstate *h)
+{
+   return false;
+}
+
 static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
   struct mm_struct *mm, pte_t *pte)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3c21775..8b0afdc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1635,6 +1635,9 @@ struct page *alloc_huge_page_node(struct hstate *h, int 
nid)
if (nid != NUMA_NO_NODE)
gfp_mask |= __GFP_THISNODE;
 
+   if (hugepage_movable_required(h))
+   gfp_mask |= __GFP_MOVABLE;
+
spin_lock(&hugetlb_lock);
if (h->free_huge_pages - h->resv_huge_pages > 0)
page = dequeue_huge_page_nodemask(h, gfp_mask, nid, NULL);
@@ -1652,6 +1655,9 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, 
int preferred_nid,
 {
gfp_t gfp_mask = htlb_alloc_mask(h);
 
+   if (hugepage_movable_required(h))
+   gfp_mask |= __GFP_MOVABLE;
+
spin_lock(&hugetlb_lock);
if (h->free_huge_pages - h->resv_huge_pages > 0) {
struct page *page;

Re: [PATCH] mm/thp: Correctly differentiate between mapped THP and PMD migration entry

2018-11-01 Thread Anshuman Khandual

 

On 10/18/2018 07:47 AM, Naoya Horiguchi wrote:
> On Tue, Oct 16, 2018 at 10:31:50AM -0400, Zi Yan wrote:
>> On 15 Oct 2018, at 0:06, Anshuman Khandual wrote:
>>
>>> On 10/15/2018 06:23 AM, Zi Yan wrote:
>>>> On 12 Oct 2018, at 4:00, Anshuman Khandual wrote:
>>>>
>>>>> On 10/10/2018 06:13 PM, Zi Yan wrote:
>>>>>> On 10 Oct 2018, at 0:05, Anshuman Khandual wrote:
>>>>>>
>>>>>>> On 10/09/2018 07:28 PM, Zi Yan wrote:
>>>>>>>> cc: Naoya Horiguchi (who proposed to use !_PAGE_PRESENT && !_PAGE_PSE 
>>>>>>>> for x86
>>>>>>>> PMD migration entry check)
>>>>>>>>
>>>>>>>> On 8 Oct 2018, at 23:58, Anshuman Khandual wrote:
>>>>>>>>
>>>>>>>>> A normal mapped THP page at PMD level should be correctly 
>>>>>>>>> differentiated
>>>>>>>>> from a PMD migration entry while walking the page table. A mapped THP 
>>>>>>>>> would
>>>>>>>>> additionally check positive for pmd_present() along with 
>>>>>>>>> pmd_trans_huge()
>>>>>>>>> as compared to a PMD migration entry. This just adds a new 
>>>>>>>>> conditional test
>>>>>>>>> differentiating the two while walking the page table.
>>>>>>>>>
>>>>>>>>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>>>>>>>>> Signed-off-by: Anshuman Khandual 
>>>>>>>>> ---
>>>>>>>>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always 
>>>>>>>>> mutually
>>>>>>>>> exclusive which makes the current conditional block work for both 
>>>>>>>>> mapped
>>>>>>>>> and migration entries. This is not same with arm64 where 
>>>>>>>>> pmd_trans_huge()
>>>>>>>>
>>>>>>>> !pmd_present() && pmd_trans_huge() is used to represent THPs under 
>>>>>>>> splitting,
>>>>>>>
>>>>>>> Not really if we just look at code in the conditional blocks.
>>>>>>
>>>>>> Yeah, I explained it wrong above. Sorry about that.
>>>>>>
>>>>>> In x86, pmd_present() checks (_PAGE_PRESENT | _PAGE_PROTNONE | 
>>>>>> _PAGE_PSE),
>>>>>> thus, it returns true even if the present bit is cleared but PSE bit is 
>>>>>> set.
>>>>>
>>>>> Okay.
>>>>>
>>>>>> This is done so, because THPs under splitting are regarded as present in 
>>>>>> the kernel
>>>>>> but not present when a hardware page table walker checks it.
>>>>>
>>>>> Okay.
>>>>>
>>>>>>
>>>>>> For PMD migration entry, which should be regarded as not present, if PSE 
>>>>>> bit
>>>>>> is set, which makes pmd_trans_huge() returns true, like ARM64 does, all
>>>>>> PMD migration entries will be regarded as present
>>>>>
>>>>> Okay to make pmd_present() return false pmd_trans_huge() has to return 
>>>>> false
>>>>> as well. Is there anything which can be done to get around this problem on
>>>>> X86 ? pmd_trans_huge() returning true for a migration entry sounds 
>>>>> logical.
>>>>> Otherwise we would revert the condition block order to accommodate both 
>>>>> the
>>>>> implementation for pmd_trans_huge() as suggested by Kirill before or just
>>>>> consider this patch forward.
>>>>>
>>>>> Because I am not really sure yet about the idea of getting pmd_present()
>>>>> check into pmd_trans_huge() on arm64 just to make it fit into this 
>>>>> semantics
>>>>> as suggested by Will. If a PMD is trans huge page or not should not 
>>>>> depend on
>>>>> whether it is present or not.
>>>>
>>>> In terms of THPs, we have three cases: a present THP, a THP under 
>>>> splitting,
>>>> and a THP under migration. pmd_present() and pmd_trans_huge() both return 
>>>> true
>>>> for a present THP and a THP u

Re: [PATCH] mm/thp: Correctly differentiate between mapped THP and PMD migration entry

2018-11-01 Thread Anshuman Khandual

On 10/17/2018 07:39 AM, Andrea Arcangeli wrote:
> Hello Zi,
> 
> On Sun, Oct 14, 2018 at 08:53:55PM -0400, Zi Yan wrote:
>> Hi Andrea, what is the purpose/benefit of making x86’s pmd_present() returns 
>> true
>> for a THP under splitting? Does it cause problems when ARM64’s pmd_present()
>> returns false in the same situation?
Thank you Andrea for a such a detailed explanation. It really helped us in
understanding certain subtle details about pmd_present() & pmd_trans_huge().

> 
> !pmd_present means it's a migration entry or swap entry and doesn't
> point to RAM. It means if you do pmd_to_page(*pmd) it will return you
> an undefined result.

Sure but this needs to be made clear some where. Not sure whether its better
just by adding some in-code documentation or enforcing it in generic paths.

> 
> During splitting the physical page is still very well pointed by the
> pmd as long as pmd_trans_huge returns true and you hold the
> pmd_lock.

Agreed, it still does point to a huge page in RAM. So pmd_present() should
just return true in such cases as you have explained above.

> 
> pmd_trans_huge must be true at all times for a transhuge pmd that
> points to a hugepage, or all VM fast paths won't serialize with the

But as Naoya mentioned we should not check for pmd_trans_huge() on swap or
migration entries. If this makes sense, I will be happy to look into this
further and remove/replace pmd_trans_huge() check from affected code paths.

> pmd_lock, that is the only reason why, and it's a very good reason
> because it avoids to take the pmd_lock when walking over non transhuge
> pmds (i.e. when there are no THP allocated).
> 
> Now if we've to keep _PAGE_PSE set and return true in pmd_trans_huge
> at all times, why would you want to make pmd_present return false? How
> could it help if pmd_trans_huge returns true, but pmd_present returns
> false despite pmd_to_page works fine and the pmd is really still
> pointing to the page?

Then what is the difference between pmd_trans_huge() and pmd_present()
if both should return true if the PMD points to a huge page in RAM and
pmd_page() also returns a valid huge page in RAM.

> 
> When userland faults on such pmd !pmd_present it will make the page
> fault take a swap or migration path, but that's the wrong path if the
> pmd points to RAM.
This is a real concern. __handle_mm_fault() does check for a swap entry
(which can only be a migration entry at the moment) and then wait on
till the migration is completed.

if (unlikely(is_swap_pmd(orig_pmd))) {
VM_BUG_ON(thp_migration_supported() &&
  !is_pmd_migration_entry(orig_pmd));
if (is_pmd_migration_entry(orig_pmd))
pmd_migration_entry_wait(mm, vmf.pmd);
return 0;
}

> 
> What we need to do during split is an invalidate of the huge TL> There's no 
> pmd_trans_splitting anymore, so we only clear the present
> bit in the PTE despite pmd_present still returns true (just like
> PROT_NONE, nothing new in this respect). pmd_present never meant the

On arm64, the problem is that pmd_present() is tied with pte_present() which
checks for PTE_VALID (also PTE_PROT_NONE) but which gets cleared during PTE
invalidation. pmd_present() returns false just after the first step of PMD
splitting. So pmd_present() needs to be decoupled from PTE_VALID which is
same as PMD_SECT_VALID and instead should depend upon a pte bit which sticks
around like PAGE_PSE as in case of x86. I am working towards a solution.

> real present bit in the pte was set, it just means the pmd points to
> RAM. It means it doesn't point to swap or migration entry and you can
> do pmd_to_page and it works fine
> We need to invalidate the TLB by clearing the present bit and by
> flushing the TLB before overwriting the transhuge pmd with the regular
> pte (i.e. to make it non huge). That is actually required by an errata
> (l1 cache aliasing of the same mapping through two different TLB of
> two different sizes broke some old CPU and triggered machine checks).
> It's not something fundamentally necessary from a common code point of

TLB entries mapping same VA -> PA space with different pages sizes might
not co-exist with each other which requires TLB invalidation. PMD split
phase initiating a TLB invalidation is not like getting around a CPU HW
problem but its just that SW should not assume behavior on behalf of the
architecture regarding which TLB entries can co-exist at any point.

> view. It's more risky from an hardware (not software) standpoint and
> before you can get rid of the pmd you need to do a TLB flush anyway to
> be sure CPUs stops using it, so better clear the present bit before
> doing the real costly thing (the tlb flush with IPIs). Clearing the
> present bit during the TLB flush is a cost that gets lost in the noise.

Doing TLB invalidation is not tied to whether present bit is m

[PATCH V3 0/5] arm64/mm: Enable HugeTLB migration

2018-10-23 Thread Anshuman Khandual

This patch series enables HugeTLB migration support for all supported
huge page sizes at all levels including contiguous bit implementation.
Following HugeTLB migration support matrix has been enabled with this
patch series. All permutations have been tested except for the 16GB.

 CONT PTEPMDCONT PMDPUD
 ------
4K: 64K 2M 32M 1G
16K: 2M32M  1G
64K: 2M   512M 16G

First the series adds migration support for PUD based huge pages. It
then adds a platform specific hook to query an architecture if a
given huge page size is supported for migration while also providing
a default fallback option preserving the existing semantics which just
checks for (PMD|PUD|PGDIR)_SHIFT macros. The last two patches enables
HugeTLB migration on arm64 and subscribe to this new platform specific
hook by defining an override.

The second patch differentiates between movability and migratability
aspects of huge pages and implements hugepage_movable_supported() which
can then be used during allocation to decide whether to place the huge
page in movable zone or not.

Changes in V3:

- Re-ordered patches 1 and 2 per Michal
- s/Movability/Migratability/ in unmap_and_move_huge_page() per Naoya

Changes in V2: (https://lkml.org/lkml/2018/10/12/190)

- Added a new patch which differentiates migratability and movability
  of huge pages and implements hugepage_movable_supported() function
  as suggested by Michal Hocko.

Anshuman Khandual (5):
  mm/hugetlb: Distinguish between migratability and movability
  mm/hugetlb: Enable PUD level huge page migration
  mm/hugetlb: Enable arch specific huge page size support for migration
  arm64/mm: Enable HugeTLB migration
  arm64/mm: Enable HugeTLB migration for contiguous bit HugeTLB pages

 arch/arm64/Kconfig   |  4 
 arch/arm64/include/asm/hugetlb.h |  5 +
 arch/arm64/mm/hugetlbpage.c  | 20 +
 include/linux/hugetlb.h  | 48 +---
 mm/hugetlb.c |  2 +-
 mm/migrate.c |  2 +-
 6 files changed, 76 insertions(+), 5 deletions(-)

-- 
2.7.4

[PATCH V3 3/5] mm/hugetlb: Enable arch specific huge page size support for migration

2018-10-23 Thread Anshuman Khandual

Architectures like arm64 have HugeTLB page sizes which are different than
generic sizes at PMD, PUD, PGD level and implemented via contiguous bits.
At present these special size HugeTLB pages cannot be identified through
macros like (PMD|PUD|PGDIR)_SHIFT and hence chosen not be migrated.

Enabling migration support for these special HugeTLB page sizes along with
the generic ones (PMD|PUD|PGD) would require identifying all of them on a
given platform. A platform specific hook can precisely enumerate all huge
page sizes supported for migration. Instead of comparing against standard
huge page orders let hugetlb_migration_support() function call a platform
hook arch_hugetlb_migration_support(). Default definition for the platform
hook maintains existing semantics which checks standard huge page order.
But an architecture can choose to override the default and provide support
for a comprehensive set of huge page sizes.

Reviewed-by: Naoya Horiguchi 
Signed-off-by: Anshuman Khandual 
---
 include/linux/hugetlb.h | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 70bcd89..4cc3871 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -493,18 +493,29 @@ static inline pgoff_t basepage_index(struct page *page)
 extern int dissolve_free_huge_page(struct page *page);
 extern int dissolve_free_huge_pages(unsigned long start_pfn,
unsigned long end_pfn);
-static inline bool hugepage_migration_supported(struct hstate *h)
-{
+
 #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
+#ifndef arch_hugetlb_migration_supported
+static inline bool arch_hugetlb_migration_supported(struct hstate *h)
+{
if ((huge_page_shift(h) == PMD_SHIFT) ||
(huge_page_shift(h) == PUD_SHIFT) ||
(huge_page_shift(h) == PGDIR_SHIFT))
return true;
else
return false;
+}
+#endif
 #else
+static inline bool arch_hugetlb_migration_supported(struct hstate *h)
+{
return false;
+}
 #endif
+
+static inline bool hugepage_migration_supported(struct hstate *h)
+{
+   return arch_hugetlb_migration_supported(h);
 }
 
 /*
-- 
2.7.4

[PATCH V3 2/5] mm/hugetlb: Enable PUD level huge page migration

2018-10-23 Thread Anshuman Khandual

Architectures like arm64 have PUD level HugeTLB pages for certain configs
(1GB huge page is PUD based on ARM64_4K_PAGES base page size) that can be
enabled for migration. It can be achieved through checking for PUD_SHIFT
order based HugeTLB pages during migration.

Reviewed-by: Naoya Horiguchi 
Signed-off-by: Anshuman Khandual 
---
 include/linux/hugetlb.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 1b858d7..70bcd89 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -497,7 +497,8 @@ static inline bool hugepage_migration_supported(struct 
hstate *h)
 {
 #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
if ((huge_page_shift(h) == PMD_SHIFT) ||
-   (huge_page_shift(h) == PGDIR_SHIFT))
+   (huge_page_shift(h) == PUD_SHIFT) ||
+   (huge_page_shift(h) == PGDIR_SHIFT))
return true;
else
return false;
-- 
2.7.4

[PATCH V3 1/5] mm/hugetlb: Distinguish between migratability and movability

2018-10-23 Thread Anshuman Khandual

During huge page allocation it's migratability is checked to determine if
it should be placed under movable zones with GFP_HIGHUSER_MOVABLE. But the
movability aspect of the huge page could depend on other factors than just
migratability. Movability in itself is a distinct property which should not
be tied with migratability alone.

This differentiates these two and implements an enhanced movability check
which also considers huge page size to determine if it is feasible to be
placed under a movable zone. At present it just checks for gigantic pages
but going forward it can incorporate other enhanced checks.

Reviewed-by: Naoya Horiguchi 
Suggested-by: Michal Hocko 
Signed-off-by: Anshuman Khandual 
---
 include/linux/hugetlb.h | 30 ++
 mm/hugetlb.c|  2 +-
 mm/migrate.c|  2 +-
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 087fd5f4..1b858d7 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -506,6 +506,31 @@ static inline bool hugepage_migration_supported(struct 
hstate *h)
 #endif
 }
 
+/*
+ * Movability check is different as compared to migration check.
+ * It determines whether or not a huge page should be placed on
+ * movable zone or not. Movability of any huge page should be
+ * required only if huge page size is supported for migration.
+ * There wont be any reason for the huge page to be movable if
+ * it is not migratable to start with. Also the size of the huge
+ * page should be large enough to be placed under a movable zone
+ * and still feasible enough to be migratable. Just the presence
+ * in movable zone does not make the migration feasible.
+ *
+ * So even though large huge page sizes like the gigantic ones
+ * are migratable they should not be movable because its not
+ * feasible to migrate them from movable zone.
+ */
+static inline bool hugepage_movable_supported(struct hstate *h)
+{
+   if (!hugepage_migration_supported(h))
+   return false;
+
+   if (hstate_is_gigantic(h))
+   return false;
+   return true;
+}
+
 static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
   struct mm_struct *mm, pte_t *pte)
 {
@@ -602,6 +627,11 @@ static inline bool hugepage_migration_supported(struct 
hstate *h)
return false;
 }
 
+static inline bool hugepage_movable_supported(struct hstate *h)
+{
+   return false;
+}
+
 static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
   struct mm_struct *mm, pte_t *pte)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5c390f5..f810cf0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -919,7 +919,7 @@ static struct page *dequeue_huge_page_nodemask(struct 
hstate *h, gfp_t gfp_mask,
 /* Movability of hugepages depends on migration support. */
 static inline gfp_t htlb_alloc_mask(struct hstate *h)
 {
-   if (hugepage_migration_supported(h))
+   if (hugepage_movable_supported(h))
return GFP_HIGHUSER_MOVABLE;
else
return GFP_HIGHUSER;
diff --git a/mm/migrate.c b/mm/migrate.c
index 84381b5..bfda9e4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1272,7 +1272,7 @@ static int unmap_and_move_huge_page(new_page_t 
get_new_page,
struct anon_vma *anon_vma = NULL;
 
/*
-* Movability of hugepages depends on architectures and hugepage size.
+* Migratability of hugepages depends on architectures and their size.
 * This check is necessary because some callers of hugepage migration
 * like soft offline and memory hotremove don't walk through page
 * tables or check whether the hugepage is pmd-based or not before
-- 
2.7.4

[PATCH V3 4/5] arm64/mm: Enable HugeTLB migration

2018-10-23 Thread Anshuman Khandual

Let arm64 subscribe to generic HugeTLB page migration framework. Right now
this only works on the following PMD and PUD level HugeTLB page sizes with
various kernel base page size combinations.

   CONT PTEPMDCONT PMDPUD
   ------
4K: NA 2M NA  1G
16K:NA32M NA
64K:NA   512M NA

Reviewed-by: Naoya Horiguchi 
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/Kconfig | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a8ae30f..4b3e269 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1331,6 +1331,10 @@ config SYSVIPC_COMPAT
def_bool y
depends on COMPAT && SYSVIPC
 
+config ARCH_ENABLE_HUGEPAGE_MIGRATION
+   def_bool y
+   depends on HUGETLB_PAGE && MIGRATION
+
 menu "Power management options"
 
 source "kernel/power/Kconfig"
-- 
2.7.4

[PATCH V3 5/5] arm64/mm: Enable HugeTLB migration for contiguous bit HugeTLB pages

2018-10-23 Thread Anshuman Khandual

Let arm64 subscribe to the previously added framework in which architecture
can inform whether a given huge page size is supported for migration. This
just overrides the default function arch_hugetlb_migration_supported() and
enables migration for all possible HugeTLB page sizes on arm64. With this,
HugeTLB migration support on arm64 now covers all possible HugeTLB options.

CONT PTEPMDCONT PMDPUD
------
4K:64K  2M32M  1G
16K:2M 32M 1G
64K:2M512M16G

Reviewed-by: Naoya Horiguchi 
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/include/asm/hugetlb.h |  5 +
 arch/arm64/mm/hugetlbpage.c  | 20 
 2 files changed, 25 insertions(+)

diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
index e73f685..656f70e 100644
--- a/arch/arm64/include/asm/hugetlb.h
+++ b/arch/arm64/include/asm/hugetlb.h
@@ -20,6 +20,11 @@
 
 #include 
 
+#ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
+#define arch_hugetlb_migration_supported arch_hugetlb_migration_supported
+extern bool arch_hugetlb_migration_supported(struct hstate *h);
+#endif
+
 static inline pte_t huge_ptep_get(pte_t *ptep)
 {
return READ_ONCE(*ptep);
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 21512ca..f3afdcf 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -27,6 +27,26 @@
 #include 
 #include 
 
+#ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
+bool arch_hugetlb_migration_supported(struct hstate *h)
+{
+   size_t pagesize = huge_page_size(h);
+
+   switch (pagesize) {
+#ifdef CONFIG_ARM64_4K_PAGES
+   case PUD_SIZE:
+#endif
+   case PMD_SIZE:
+   case CONT_PMD_SIZE:
+   case CONT_PTE_SIZE:
+   return true;
+   }
+   pr_warn("%s: unrecognized huge page size 0x%lx\n",
+   __func__, pagesize);
+   return false;
+}
+#endif
+
 int pmd_huge(pmd_t pmd)
 {
return pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT);
-- 
2.7.4

Re: [PATCH V3 3/5] mm/hugetlb: Enable arch specific huge page size support for migration

2018-10-24 Thread Anshuman Khandual




On 10/24/2018 07:28 PM, Michal Hocko wrote:
> On Wed 24-10-18 15:56:39, Michal Hocko wrote:
>> On Tue 23-10-18 18:31:59, Anshuman Khandual wrote:
>>> Architectures like arm64 have HugeTLB page sizes which are different than
>>> generic sizes at PMD, PUD, PGD level and implemented via contiguous bits.
>>> At present these special size HugeTLB pages cannot be identified through
>>> macros like (PMD|PUD|PGDIR)_SHIFT and hence chosen not be migrated.
>>>
>>> Enabling migration support for these special HugeTLB page sizes along with
>>> the generic ones (PMD|PUD|PGD) would require identifying all of them on a
>>> given platform. A platform specific hook can precisely enumerate all huge
>>> page sizes supported for migration. Instead of comparing against standard
>>> huge page orders let hugetlb_migration_support() function call a platform
>>> hook arch_hugetlb_migration_support(). Default definition for the platform
>>> hook maintains existing semantics which checks standard huge page order.
>>> But an architecture can choose to override the default and provide support
>>> for a comprehensive set of huge page sizes.
>>>
>>> Reviewed-by: Naoya Horiguchi 
>>> Signed-off-by: Anshuman Khandual 
>>
>> Acked-by: Michal Hocko 
> 
> fat fingers here, should be mho...@suse.com of course.

Sure no problems. As we had discussed earlier and agreed to keep the previous
patch "mm/hugetlb: Enable PUD level huge page migration" separate and not fold
into this one, I will assume your ACK on it as well unless your disagree.

Re: [PATCH] mm/thp: Correctly differentiate between mapped THP and PMD migration entry

2018-10-25 Thread Anshuman Khandual




On 10/16/2018 08:01 PM, Zi Yan wrote:
> On 15 Oct 2018, at 0:06, Anshuman Khandual wrote:
> 
>> On 10/15/2018 06:23 AM, Zi Yan wrote:
>>> On 12 Oct 2018, at 4:00, Anshuman Khandual wrote:
>>>
>>>> On 10/10/2018 06:13 PM, Zi Yan wrote:
>>>>> On 10 Oct 2018, at 0:05, Anshuman Khandual wrote:
>>>>>
>>>>>> On 10/09/2018 07:28 PM, Zi Yan wrote:
>>>>>>> cc: Naoya Horiguchi (who proposed to use !_PAGE_PRESENT && !_PAGE_PSE 
>>>>>>> for x86
>>>>>>> PMD migration entry check)
>>>>>>>
>>>>>>> On 8 Oct 2018, at 23:58, Anshuman Khandual wrote:
>>>>>>>
>>>>>>>> A normal mapped THP page at PMD level should be correctly 
>>>>>>>> differentiated
>>>>>>>> from a PMD migration entry while walking the page table. A mapped THP 
>>>>>>>> would
>>>>>>>> additionally check positive for pmd_present() along with 
>>>>>>>> pmd_trans_huge()
>>>>>>>> as compared to a PMD migration entry. This just adds a new conditional 
>>>>>>>> test
>>>>>>>> differentiating the two while walking the page table.
>>>>>>>>
>>>>>>>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>>>>>>>> Signed-off-by: Anshuman Khandual 
>>>>>>>> ---
>>>>>>>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always 
>>>>>>>> mutually
>>>>>>>> exclusive which makes the current conditional block work for both 
>>>>>>>> mapped
>>>>>>>> and migration entries. This is not same with arm64 where 
>>>>>>>> pmd_trans_huge()
>>>>>>>
>>>>>>> !pmd_present() && pmd_trans_huge() is used to represent THPs under 
>>>>>>> splitting,
>>>>>>
>>>>>> Not really if we just look at code in the conditional blocks.
>>>>>
>>>>> Yeah, I explained it wrong above. Sorry about that.
>>>>>
>>>>> In x86, pmd_present() checks (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE),
>>>>> thus, it returns true even if the present bit is cleared but PSE bit is 
>>>>> set.
>>>>
>>>> Okay.
>>>>
>>>>> This is done so, because THPs under splitting are regarded as present in 
>>>>> the kernel
>>>>> but not present when a hardware page table walker checks it.
>>>>
>>>> Okay.
>>>>
>>>>>
>>>>> For PMD migration entry, which should be regarded as not present, if PSE 
>>>>> bit
>>>>> is set, which makes pmd_trans_huge() returns true, like ARM64 does, all
>>>>> PMD migration entries will be regarded as present
>>>>
>>>> Okay to make pmd_present() return false pmd_trans_huge() has to return 
>>>> false
>>>> as well. Is there anything which can be done to get around this problem on
>>>> X86 ? pmd_trans_huge() returning true for a migration entry sounds logical.
>>>> Otherwise we would revert the condition block order to accommodate both the
>>>> implementation for pmd_trans_huge() as suggested by Kirill before or just
>>>> consider this patch forward.
>>>>
>>>> Because I am not really sure yet about the idea of getting pmd_present()
>>>> check into pmd_trans_huge() on arm64 just to make it fit into this 
>>>> semantics
>>>> as suggested by Will. If a PMD is trans huge page or not should not depend 
>>>> on
>>>> whether it is present or not.
>>>
>>> In terms of THPs, we have three cases: a present THP, a THP under splitting,
>>> and a THP under migration. pmd_present() and pmd_trans_huge() both return 
>>> true
>>> for a present THP and a THP under splitting, because they discover 
>>> _PAGE_PSE bit
>>
>> Then how do we differentiate between a mapped THP and a splitting THP.
> 
> AFAIK, in x86, there is no distinction between a mapped THP and a splitting 
> THP
> using helper functions.
> 
> A mapped THP has _PAGE_PRESENT bit and _PAGE_PSE bit set, whereas a splitting 
> THP
> has only _PAGE_PSE bit set. But both pmd_present() and pmd_trans_huge() return
> true a

Re: [PATCH] mm/thp: Correctly differentiate between mapped THP and PMD migration entry

2018-10-25 Thread Anshuman Khandual




On 10/26/2018 12:15 AM, Zi Yan wrote:
> On 25 Oct 2018, at 4:10, Anshuman Khandual wrote:
> 
>> On 10/16/2018 08:01 PM, Zi Yan wrote:
>>> On 15 Oct 2018, at 0:06, Anshuman Khandual wrote:
>>>
>>>> On 10/15/2018 06:23 AM, Zi Yan wrote:
>>>>> On 12 Oct 2018, at 4:00, Anshuman Khandual wrote:
>>>>>
>>>>>> On 10/10/2018 06:13 PM, Zi Yan wrote:
>>>>>>> On 10 Oct 2018, at 0:05, Anshuman Khandual wrote:
>>>>>>>
>>>>>>>> On 10/09/2018 07:28 PM, Zi Yan wrote:
>>>>>>>>> cc: Naoya Horiguchi (who proposed to use !_PAGE_PRESENT && !_PAGE_PSE 
>>>>>>>>> for x86
>>>>>>>>> PMD migration entry check)
>>>>>>>>>
>>>>>>>>> On 8 Oct 2018, at 23:58, Anshuman Khandual wrote:
>>>>>>>>>
>>>>>>>>>> A normal mapped THP page at PMD level should be correctly 
>>>>>>>>>> differentiated
>>>>>>>>>> from a PMD migration entry while walking the page table. A mapped 
>>>>>>>>>> THP would
>>>>>>>>>> additionally check positive for pmd_present() along with 
>>>>>>>>>> pmd_trans_huge()
>>>>>>>>>> as compared to a PMD migration entry. This just adds a new 
>>>>>>>>>> conditional test
>>>>>>>>>> differentiating the two while walking the page table.
>>>>>>>>>>
>>>>>>>>>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic 
>>>>>>>>>> path")
>>>>>>>>>> Signed-off-by: Anshuman Khandual 
>>>>>>>>>> ---
>>>>>>>>>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always 
>>>>>>>>>> mutually
>>>>>>>>>> exclusive which makes the current conditional block work for both 
>>>>>>>>>> mapped
>>>>>>>>>> and migration entries. This is not same with arm64 where 
>>>>>>>>>> pmd_trans_huge()
>>>>>>>>>
>>>>>>>>> !pmd_present() && pmd_trans_huge() is used to represent THPs under 
>>>>>>>>> splitting,
>>>>>>>>
>>>>>>>> Not really if we just look at code in the conditional blocks.
>>>>>>>
>>>>>>> Yeah, I explained it wrong above. Sorry about that.
>>>>>>>
>>>>>>> In x86, pmd_present() checks (_PAGE_PRESENT | _PAGE_PROTNONE | 
>>>>>>> _PAGE_PSE),
>>>>>>> thus, it returns true even if the present bit is cleared but PSE bit is 
>>>>>>> set.
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>> This is done so, because THPs under splitting are regarded as present 
>>>>>>> in the kernel
>>>>>>> but not present when a hardware page table walker checks it.
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>>
>>>>>>> For PMD migration entry, which should be regarded as not present, if 
>>>>>>> PSE bit
>>>>>>> is set, which makes pmd_trans_huge() returns true, like ARM64 does, all
>>>>>>> PMD migration entries will be regarded as present
>>>>>>
>>>>>> Okay to make pmd_present() return false pmd_trans_huge() has to return 
>>>>>> false
>>>>>> as well. Is there anything which can be done to get around this problem 
>>>>>> on
>>>>>> X86 ? pmd_trans_huge() returning true for a migration entry sounds 
>>>>>> logical.
>>>>>> Otherwise we would revert the condition block order to accommodate both 
>>>>>> the
>>>>>> implementation for pmd_trans_huge() as suggested by Kirill before or just
>>>>>> consider this patch forward.
>>>>>>
>>>>>> Because I am not really sure yet about the idea of getting pmd_present()
>>>>>> check into pmd_trans_huge() on arm64 just to make it fit into this 
>>>>>> semantics
>>>>>> as suggested by Will. If a PMD is

Re: [PATCH V2 2/5] mm/hugetlb: Distinguish between migratability and movability

2018-10-18 Thread Anshuman Khandual




On 10/19/2018 07:29 AM, Naoya Horiguchi wrote:
> On Fri, Oct 12, 2018 at 09:29:56AM +0530, Anshuman Khandual wrote:
>> During huge page allocation it's migratability is checked to determine if
>> it should be placed under movable zones with GFP_HIGHUSER_MOVABLE. But the
>> movability aspect of the huge page could depend on other factors than just
>> migratability. Movability in itself is a distinct property which should not
>> be tied with migratability alone.
>>
>> This differentiates these two and implements an enhanced movability check
>> which also considers huge page size to determine if it is feasible to be
>> placed under a movable zone. At present it just checks for gigantic pages
>> but going forward it can incorporate other enhanced checks.
> 
> (nitpicking...)
> The following code just checks hugepage_migration_supported(), so maybe
> s/Movability/Migratability/ is expected in the comment?
> 
>   static int unmap_and_move_huge_page(...)
>   {
>   ...
>   /*
>* Movability of hugepages depends on architectures and hugepage 
> size.
>* This check is necessary because some callers of hugepage 
> migration
>* like soft offline and memory hotremove don't walk through page
>* tables or check whether the hugepage is pmd-based or not before
>* kicking migration.
>*/
>   if (!hugepage_migration_supported(page_hstate(hpage))) {
> 
Sure, will update this patch only unless other changes are suggested.

Re: [PATCH V2 1/5] mm/hugetlb: Enable PUD level huge page migration

2018-10-21 Thread Anshuman Khandual

On 10/19/2018 01:39 PM, Michal Hocko wrote:
> I planed to get to review this earlier but been busy. Anyway, this patch
> should be applied only after movability one to prevent from
> (theoretical) bisectability issues.

Sure, I can change the order there.

> 
> I would probably fold it into the one which defines arch specific hook.

Hmm but that may be like doing two functionality changes together. Adding one
more conditional check and at the same time wrapping it over with a new name
which is part of a new scheme. I would suggest keeping the patches separate.

Re: [PATCH RFC v3 01/35] mm: page_alloc: Add gfp_flags parameter to arch_alloc_page()

2024-01-28 Thread Anshuman Khandual



On 1/25/24 22:12, Alexandru Elisei wrote:
> Extend the usefulness of arch_alloc_page() by adding the gfp_flags
> parameter.

Although the change here is harmless in itself, it will definitely benefit
from some additional context explaining the rationale, taking into account
why-how arch_alloc_page() got added particularly for s390 platform and how
it's going to be used in the present proposal.

> 
> Signed-off-by: Alexandru Elisei 
> ---
> 
> Changes since rfc v2:
> 
> * New patch.
> 
>  arch/s390/include/asm/page.h | 2 +-
>  arch/s390/mm/page-states.c   | 2 +-
>  include/linux/gfp.h  | 2 +-
>  mm/page_alloc.c  | 2 +-
>  4 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
> index 73b9c3bf377f..859f0958c574 100644
> --- a/arch/s390/include/asm/page.h
> +++ b/arch/s390/include/asm/page.h
> @@ -163,7 +163,7 @@ static inline int page_reset_referenced(unsigned long 
> addr)
>  
>  struct page;
>  void arch_free_page(struct page *page, int order);
> -void arch_alloc_page(struct page *page, int order);
> +void arch_alloc_page(struct page *page, int order, gfp_t gfp_flags);
>  
>  static inline int devmem_is_allowed(unsigned long pfn)
>  {
> diff --git a/arch/s390/mm/page-states.c b/arch/s390/mm/page-states.c
> index 01f9b39e65f5..b986c8b158e3 100644
> --- a/arch/s390/mm/page-states.c
> +++ b/arch/s390/mm/page-states.c
> @@ -21,7 +21,7 @@ void arch_free_page(struct page *page, int order)
>   __set_page_unused(page_to_virt(page), 1UL << order);
>  }
>  
> -void arch_alloc_page(struct page *page, int order)
> +void arch_alloc_page(struct page *page, int order, gfp_t gfp_flags)
>  {
>   if (!cmma_flag)
>   return;
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index de292a007138..9e8aa3d144db 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -172,7 +172,7 @@ static inline struct zonelist *node_zonelist(int nid, 
> gfp_t flags)
>  static inline void arch_free_page(struct page *page, int order) { }
>  #endif
>  #ifndef HAVE_ARCH_ALLOC_PAGE
> -static inline void arch_alloc_page(struct page *page, int order) { }
> +static inline void arch_alloc_page(struct page *page, int order, gfp_t 
> gfp_flags) { }
>  #endif
>  
>  struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 150d4f23b010..2c140abe5ee6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1485,7 +1485,7 @@ inline void post_alloc_hook(struct page *page, unsigned 
> int order,
>   set_page_private(page, 0);
>   set_page_refcounted(page);
>  
> - arch_alloc_page(page, order);
> + arch_alloc_page(page, order, gfp_flags);
>   debug_pagealloc_map_pages(page, 1 << order);
>  
>   /*

Otherwise LGTM.

Re: [PATCH RFC v3 02/35] mm: page_alloc: Add an arch hook early in free_pages_prepare()

2024-01-29 Thread Anshuman Khandual




On 1/25/24 22:12, Alexandru Elisei wrote:
> The arm64 MTE code uses the PG_arch_2 page flag, which it renames to
> PG_mte_tagged, to track if a page has been mapped with tagging enabled.
> That flag is cleared by free_pages_prepare() by doing:
> 
>   page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> 
> When tag storage management is added, tag storage will be reserved for a
> page if and only if the page is mapped as tagged (the page flag
> PG_mte_tagged is set). When a page is freed, likewise, the code will have
> to look at the the page flags to determine if the page has tag storage
> reserved, which should also be freed.
> 
> For this purpose, add an arch_free_pages_prepare() hook that is called
> before that page flags are cleared. The function arch_free_page() has also
> been considered for this purpose, but it is called after the flags are
> cleared.

arch_free_pages_prepare() makes sense as a prologue to arch_free_page().  

s/arch_free_pages_prepare/arch_free_page_prepare to match similar functions.

> 
> Signed-off-by: Alexandru Elisei 
> ---
> 
> Changes since rfc v2:
> 
> * Expanded commit message (David Hildenbrand).
> 
>  include/linux/pgtable.h | 4 
>  mm/page_alloc.c | 1 +
>  2 files changed, 5 insertions(+)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index f6d0e3513948..6d98d5fdd697 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -901,6 +901,10 @@ static inline void arch_do_swap_page(struct mm_struct 
> *mm,
>  }
>  #endif
>  
> +#ifndef __HAVE_ARCH_FREE_PAGES_PREPARE

I guess new __HAVE_ARCH_ constructs are not being added lately. Instead
something like '#ifndef arch_free_pages_prepare' might be better suited.

> +static inline void arch_free_pages_prepare(struct page *page, int order) { }
> +#endif
> +
>  #ifndef __HAVE_ARCH_UNMAP_ONE
>  /*
>   * Some architectures support metadata associated with a page. When a
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2c140abe5ee6..27282a1c82fe 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1092,6 +1092,7 @@ static __always_inline bool free_pages_prepare(struct 
> page *page,
>  
>   trace_mm_page_free(page, order);
>   kmsan_free_page(page, order);
> + arch_free_pages_prepare(page, order);
>  
>   if (memcg_kmem_online() && PageMemcgKmem(page))
>   __memcg_kmem_uncharge_page(page, order);

Re: [PATCH RFC v3 03/35] mm: page_alloc: Add an arch hook to filter MIGRATE_CMA allocations

2024-01-29 Thread Anshuman Khandual




On 1/25/24 22:12, Alexandru Elisei wrote:
> As an architecture might have specific requirements around the allocation
> of CMA pages, add an arch hook that can disable allocations from
> MIGRATE_CMA, if the allocation was otherwise allowed.
> 
> This will be used by arm64, which will put tag storage pages on the
> MIGRATE_CMA list, and tag storage pages cannot be tagged. The filter will
> be used to deny using MIGRATE_CMA for __GFP_TAGGED allocations.

Just wondering how allocation requests would be blocked for direct
alloc_contig_range() requests ?

> 
> Signed-off-by: Alexandru Elisei 
> ---
>  include/linux/pgtable.h | 7 +++
>  mm/page_alloc.c | 3 ++-
>  2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 6d98d5fdd697..c5ddec6b5305 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -905,6 +905,13 @@ static inline void arch_do_swap_page(struct mm_struct 
> *mm,
>  static inline void arch_free_pages_prepare(struct page *page, int order) { }
>  #endif
>  
> +#ifndef __HAVE_ARCH_ALLOC_CMA

Same as last patch i.e __HAVE_ARCH_ALLOC_CMA could be avoided via
a direct check on #ifndef arch_alloc_cma instead.

> +static inline bool arch_alloc_cma(gfp_t gfp)
> +{
> + return true;
> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_UNMAP_ONE
>  /*
>   * Some architectures support metadata associated with a page. When a
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 27282a1c82fe..a96d47a6393e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3157,7 +3157,8 @@ static inline unsigned int gfp_to_alloc_flags_cma(gfp_t 
> gfp_mask,
> unsigned int alloc_flags)
>  {
>  #ifdef CONFIG_CMA
> - if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> + if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE &&
> + arch_alloc_cma(gfp_mask))
>   alloc_flags |= ALLOC_CMA;
>  #endif
>   return alloc_flags;

Re: [PATCH RFC v3 04/35] mm: page_alloc: Partially revert "mm: page_alloc: remove stale CMA guard code"

2024-01-29 Thread Anshuman Khandual




On 1/25/24 22:12, Alexandru Elisei wrote:
> The patch f945116e4e19 ("mm: page_alloc: remove stale CMA guard code")
> removed the CMA filter when allocating from the MIGRATE_MOVABLE pcp list
> because CMA is always allowed when __GFP_MOVABLE is set.
> 
> With the introduction of the arch_alloc_cma() function, the above is not
> true anymore, so bring back the filter.

This makes sense as arch_alloc_cma() now might prevent ALLOC_CMA being
assigned to alloc_flags in gfp_to_alloc_flags_cma().

> 
> This is a partially revert because the stale comment remains removed.
> 
> Signed-off-by: Alexandru Elisei 
> ---
>  mm/page_alloc.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a96d47a6393e..0fa34bcfb1af 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2897,10 +2897,17 @@ struct page *rmqueue(struct zone *preferred_zone,
>   WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>  
>   if (likely(pcp_allowed_order(order))) {
> - page = rmqueue_pcplist(preferred_zone, zone, order,
> -migratetype, alloc_flags);
> - if (likely(page))
> - goto out;
> + /*
> +  * MIGRATE_MOVABLE pcplist could have the pages on CMA area and
> +  * we need to skip it when CMA area isn't allowed.
> +  */
> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
> + migratetype != MIGRATE_MOVABLE) {
> + page = rmqueue_pcplist(preferred_zone, zone, order,
> + migratetype, alloc_flags);
> + if (likely(page))
> + goto out;
> + }
>   }
>  
>   page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,

Re: [PATCH RFC v3 05/35] mm: cma: Don't append newline when generating CMA area name

2024-01-29 Thread Anshuman Khandual



On 1/25/24 22:12, Alexandru Elisei wrote:
> cma->name is displayed in several CMA messages. When the name is generated
> by the CMA code, don't append a newline to avoid breaking the text across
> two lines.

An example of such mis-formatted CMA output from dmesg could be added
here in the commit message to demonstrate the problem better.

> 
> Signed-off-by: Alexandru Elisei 
> ---

Regardless, LGTM.

Reviewed-by: Anshuman Khandual 

> 
> Changes since rfc v2:
> 
> * New patch. This is a fix, and can be merged independently of the other
> patches.

Right, need not be part of this series. Hence please send it separately to
the MM list.

> 
>  mm/cma.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/cma.c b/mm/cma.c
> index 7c09c47e530b..f49c95f8ee37 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -204,7 +204,7 @@ int __init cma_init_reserved_mem(phys_addr_t base, 
> phys_addr_t size,
>   if (name)
>   snprintf(cma->name, CMA_MAX_NAME, name);
>   else
> - snprintf(cma->name, CMA_MAX_NAME,  "cma%d\n", cma_area_count);
> + snprintf(cma->name, CMA_MAX_NAME,  "cma%d", cma_area_count);
>  
>   cma->base_pfn = PFN_DOWN(base);
>   cma->count = size >> PAGE_SHIFT;

Re: [PATCH RFC v3 06/35] mm: cma: Make CMA_ALLOC_SUCCESS/FAIL count the number of pages

2024-01-29 Thread Anshuman Khandual




On 1/25/24 22:12, Alexandru Elisei wrote:
> The CMA_ALLOC_SUCCESS, respectively CMA_ALLOC_FAIL, are increased by one
> after each cma_alloc() function call. This is done even though cma_alloc()
> can allocate an arbitrary number of CMA pages. When looking at
> /proc/vmstat, the number of successful (or failed) cma_alloc() calls
> doesn't tell much with regards to how many CMA pages were allocated via
> cma_alloc() versus via the page allocator (regular allocation request or
> PCP lists refill).
> 
> This can also be rather confusing to a user who isn't familiar with the
> code, since the unit of measurement for nr_free_cma is the number of pages,
> but cma_alloc_success and cma_alloc_fail count the number of cma_alloc()
> function calls.
> 
> Let's make this consistent, and arguably more useful, by having
> CMA_ALLOC_SUCCESS count the number of successfully allocated CMA pages, and
> CMA_ALLOC_FAIL count the number of pages the cma_alloc() failed to
> allocate.
> 
> For users that wish to track the number of cma_alloc() calls, there are
> tracepoints for that already implemented.
> 
> Signed-off-by: Alexandru Elisei 
> ---
>  mm/cma.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/cma.c b/mm/cma.c
> index f49c95f8ee37..dbf7fe8cb1bd 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -517,10 +517,10 @@ struct page *cma_alloc(struct cma *cma, unsigned long 
> count,
>   pr_debug("%s(): returned %p\n", __func__, page);
>  out:
>   if (page) {
> - count_vm_event(CMA_ALLOC_SUCCESS);
> + count_vm_events(CMA_ALLOC_SUCCESS, count);
>   cma_sysfs_account_success_pages(cma, count);
>   } else {
> - count_vm_event(CMA_ALLOC_FAIL);
> + count_vm_events(CMA_ALLOC_FAIL, count);
>   if (cma)
>   cma_sysfs_account_fail_pages(cma, count);
>   }

Without getting into the merits of this patch - which is actually trying to do
semantics change to /proc/vmstat, wondering how is this even related to this
particular series ? If required this could be debated on it's on separately.

Re: [PATCH RFC v3 07/35] mm: cma: Add CMA_RELEASE_{SUCCESS,FAIL} events

2024-01-29 Thread Anshuman Khandual




On 1/25/24 22:12, Alexandru Elisei wrote:
> Similar to the two events that relate to CMA allocations, add the
> CMA_RELEASE_SUCCESS and CMA_RELEASE_FAIL events that count when CMA pages
> are freed.

How is this is going to be beneficial towards analyzing CMA alloc/release
behaviour - particularly with respect to this series. OR just adding this
from parity perspective with CMA alloc side counters ? Regardless this
CMA change too could be discussed separately.

> 
> Signed-off-by: Alexandru Elisei 
> ---
> 
> Changes since rfc v2:
> 
> * New patch.
> 
>  include/linux/vm_event_item.h | 2 ++
>  mm/cma.c  | 6 +-
>  mm/vmstat.c   | 2 ++
>  3 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 747943bc8cc2..aba5c5bf8127 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -83,6 +83,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  #ifdef CONFIG_CMA
>   CMA_ALLOC_SUCCESS,
>   CMA_ALLOC_FAIL,
> + CMA_RELEASE_SUCCESS,
> + CMA_RELEASE_FAIL,
>  #endif
>   UNEVICTABLE_PGCULLED,   /* culled to noreclaim list */
>   UNEVICTABLE_PGSCANNED,  /* scanned for reclaimability */
> diff --git a/mm/cma.c b/mm/cma.c
> index dbf7fe8cb1bd..543bb6b3be8e 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -562,8 +562,10 @@ bool cma_release(struct cma *cma, const struct page 
> *pages,
>  {
>   unsigned long pfn;
>  
> - if (!cma_pages_valid(cma, pages, count))
> + if (!cma_pages_valid(cma, pages, count)) {
> + count_vm_events(CMA_RELEASE_FAIL, count);
>   return false;
> + }
>  
>   pr_debug("%s(page %p, count %lu)\n", __func__, (void *)pages, count);
>  
> @@ -575,6 +577,8 @@ bool cma_release(struct cma *cma, const struct page 
> *pages,
>   cma_clear_bitmap(cma, pfn, count);
>   trace_cma_release(cma->name, pfn, pages, count);
>  
> + count_vm_events(CMA_RELEASE_SUCCESS, count);
> +
>   return true;
>  }
>  
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index db79935e4a54..eebfd5c6c723 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1340,6 +1340,8 @@ const char * const vmstat_text[] = {
>  #ifdef CONFIG_CMA
>   "cma_alloc_success",
>   "cma_alloc_fail",
> + "cma_release_success",
> + "cma_release_fail",
>  #endif
>   "unevictable_pgs_culled",
>   "unevictable_pgs_scanned",

Re: [PATCH RFC v3 01/35] mm: page_alloc: Add gfp_flags parameter to arch_alloc_page()

2024-01-29 Thread Anshuman Khandual




On 1/29/24 17:11, Alexandru Elisei wrote:
> Hi,
> 
> On Mon, Jan 29, 2024 at 11:18:59AM +0530, Anshuman Khandual wrote:
>> On 1/25/24 22:12, Alexandru Elisei wrote:
>>> Extend the usefulness of arch_alloc_page() by adding the gfp_flags
>>> parameter.
>> Although the change here is harmless in itself, it will definitely benefit
>> from some additional context explaining the rationale, taking into account
>> why-how arch_alloc_page() got added particularly for s390 platform and how
>> it's going to be used in the present proposal.
> arm64 will use it to reserve tag storage if the caller requested a tagged
> page. Right now that means that __GFP_ZEROTAGS is set in the gfp mask, but
> I'll rename it to __GFP_TAGGED in patch #18 ("arm64: mte: Rename
> __GFP_ZEROTAGS to __GFP_TAGGED") [1].
> 
> [1] 
> https://lore.kernel.org/lkml/20240125164256.4147-19-alexandru.eli...@arm.com/

Makes sense, but please do update the commit message explaining how
new gfp mask argument will be used to detect tagged page allocation
requests, further requiring tag storage allocation.

Re: [PATCH RFC v3 04/35] mm: page_alloc: Partially revert "mm: page_alloc: remove stale CMA guard code"

2024-01-29 Thread Anshuman Khandual

On 1/29/24 17:16, Alexandru Elisei wrote:
> Hi,
> 
> On Mon, Jan 29, 2024 at 02:31:23PM +0530, Anshuman Khandual wrote:
>>
>>
>> On 1/25/24 22:12, Alexandru Elisei wrote:
>>> The patch f945116e4e19 ("mm: page_alloc: remove stale CMA guard code")
>>> removed the CMA filter when allocating from the MIGRATE_MOVABLE pcp list
>>> because CMA is always allowed when __GFP_MOVABLE is set.
>>>
>>> With the introduction of the arch_alloc_cma() function, the above is not
>>> true anymore, so bring back the filter.
>>
>> This makes sense as arch_alloc_cma() now might prevent ALLOC_CMA being
>> assigned to alloc_flags in gfp_to_alloc_flags_cma().
> 
> Can I add your Reviewed-by tag then?

I think all these changes need to be reviewed in their entirety
even though some patches do look good on their own. For example
this patch depends on whether [PATCH 03/35] is acceptable or not.

I would suggest separating out CMA patches which could be debated
and merged regardless of this series.

Re: [PATCH RFC v3 06/35] mm: cma: Make CMA_ALLOC_SUCCESS/FAIL count the number of pages

2024-01-29 Thread Anshuman Khandual




On 1/29/24 17:21, Alexandru Elisei wrote:
> Hi,
> 
> On Mon, Jan 29, 2024 at 02:54:20PM +0530, Anshuman Khandual wrote:
>>
>>
>> On 1/25/24 22:12, Alexandru Elisei wrote:
>>> The CMA_ALLOC_SUCCESS, respectively CMA_ALLOC_FAIL, are increased by one
>>> after each cma_alloc() function call. This is done even though cma_alloc()
>>> can allocate an arbitrary number of CMA pages. When looking at
>>> /proc/vmstat, the number of successful (or failed) cma_alloc() calls
>>> doesn't tell much with regards to how many CMA pages were allocated via
>>> cma_alloc() versus via the page allocator (regular allocation request or
>>> PCP lists refill).
>>>
>>> This can also be rather confusing to a user who isn't familiar with the
>>> code, since the unit of measurement for nr_free_cma is the number of pages,
>>> but cma_alloc_success and cma_alloc_fail count the number of cma_alloc()
>>> function calls.
>>>
>>> Let's make this consistent, and arguably more useful, by having
>>> CMA_ALLOC_SUCCESS count the number of successfully allocated CMA pages, and
>>> CMA_ALLOC_FAIL count the number of pages the cma_alloc() failed to
>>> allocate.
>>>
>>> For users that wish to track the number of cma_alloc() calls, there are
>>> tracepoints for that already implemented.
>>>
>>> Signed-off-by: Alexandru Elisei 
>>> ---
>>>  mm/cma.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/cma.c b/mm/cma.c
>>> index f49c95f8ee37..dbf7fe8cb1bd 100644
>>> --- a/mm/cma.c
>>> +++ b/mm/cma.c
>>> @@ -517,10 +517,10 @@ struct page *cma_alloc(struct cma *cma, unsigned long 
>>> count,
>>> pr_debug("%s(): returned %p\n", __func__, page);
>>>  out:
>>> if (page) {
>>> -   count_vm_event(CMA_ALLOC_SUCCESS);
>>> +   count_vm_events(CMA_ALLOC_SUCCESS, count);
>>> cma_sysfs_account_success_pages(cma, count);
>>> } else {
>>> -   count_vm_event(CMA_ALLOC_FAIL);
>>> +   count_vm_events(CMA_ALLOC_FAIL, count);
>>> if (cma)
>>> cma_sysfs_account_fail_pages(cma, count);
>>> }
>>
>> Without getting into the merits of this patch - which is actually trying to 
>> do
>> semantics change to /proc/vmstat, wondering how is this even related to this
>> particular series ? If required this could be debated on it's on separately.
> 
> Having the number of CMA pages allocated and the number of CMA pages freed
> allows someone to infer how many tagged pages are in use at a given time:

That should not be done in CMA which is a generic multi purpose allocator.

> (allocated CMA pages - CMA pages allocated by drivers* - CMA pages
> released) * 32. That is valuable information for software and hardware
> designers.
> 
> Besides that, for every iteration of the series, this has proven invaluable
> for discovering bugs with freeing and/or reserving tag storage pages.

I am afraid that might not be enough justification for getting something
merged mainline.

> 
> *that would require userspace reading cma_alloc_success and
> cma_release_success before any tagged allocations are performed.

While assuming that no other non-memory-tagged CMA based allocation amd free
call happens in the meantime ? That would be on real thin ice.

I suppose arm64 tagged memory specific allocation or free related counters
need to be created on the caller side, including arch_free_pages_prepare().

Re: [PATCH RFC v3 08/35] mm: cma: Introduce cma_alloc_range()

2024-01-29 Thread Anshuman Khandual




On 1/25/24 22:12, Alexandru Elisei wrote:
> Today, cma_alloc() is used to allocate a contiguous memory region. The
> function allows the caller to specify the number of pages to allocate, but
> not the starting address. cma_alloc() will walk over the entire CMA region
> trying to allocate the first available range of the specified size.
> 
> Introduce cma_alloc_range(), which makes CMA more versatile by allowing the
> caller to specify a particular range in the CMA region, defined by the
> start pfn and the size.
> 
> arm64 will make use of this function when tag storage management will be
> implemented: cma_alloc_range() will be used to reserve the tag storage
> associated with a tagged page.

Basically, you would like to pass on a preferred start address and the
allocation could just fail if a contig range is not available from such
a starting address ?

Then why not just change cma_alloc() to take a new argument 'start_pfn'.
Why create a new but almost similar allocator ?

But then I am wondering why this could not be done in the arm64 platform
code itself operating on a CMA area reserved just for tag storage. Unless
this new allocator has other usage beyond MTE, this could be implemented
in the platform itself.

> 
> Signed-off-by: Alexandru Elisei 
> ---
> 
> Changes since rfc v2:
> 
> * New patch.
> 
>  include/linux/cma.h|  2 +
>  include/trace/events/cma.h | 59 ++
>  mm/cma.c   | 86 ++
>  3 files changed, 147 insertions(+)
> 
> diff --git a/include/linux/cma.h b/include/linux/cma.h
> index 63873b93deaa..e32559da6942 100644
> --- a/include/linux/cma.h
> +++ b/include/linux/cma.h
> @@ -50,6 +50,8 @@ extern int cma_init_reserved_mem(phys_addr_t base, 
> phys_addr_t size,
>   struct cma **res_cma);
>  extern struct page *cma_alloc(struct cma *cma, unsigned long count, unsigned 
> int align,
> bool no_warn);
> +extern int cma_alloc_range(struct cma *cma, unsigned long start, unsigned 
> long count,
> +unsigned tries, gfp_t gfp);
>  extern bool cma_pages_valid(struct cma *cma, const struct page *pages, 
> unsigned long count);
>  extern bool cma_release(struct cma *cma, const struct page *pages, unsigned 
> long count);
>  
> diff --git a/include/trace/events/cma.h b/include/trace/events/cma.h
> index 25103e67737c..a89af313a572 100644
> --- a/include/trace/events/cma.h
> +++ b/include/trace/events/cma.h
> @@ -36,6 +36,65 @@ TRACE_EVENT(cma_release,
> __entry->count)
>  );
>  
> +TRACE_EVENT(cma_alloc_range_start,
> +
> + TP_PROTO(const char *name, unsigned long start, unsigned long count,
> +  unsigned tries),
> +
> + TP_ARGS(name, start, count, tries),
> +
> + TP_STRUCT__entry(
> + __string(name, name)
> + __field(unsigned long, start)
> + __field(unsigned long, count)
> + __field(unsigned, tries)
> + ),
> +
> + TP_fast_assign(
> + __assign_str(name, name);
> + __entry->start = start;
> + __entry->count = count;
> + __entry->tries = tries;
> + ),
> +
> + TP_printk("name=%s start=%lx count=%lu tries=%u",
> +   __get_str(name),
> +   __entry->start,
> +   __entry->count,
> +   __entry->tries)
> +);
> +
> +TRACE_EVENT(cma_alloc_range_finish,
> +
> + TP_PROTO(const char *name, unsigned long start, unsigned long count,
> +  unsigned attempts, int err),
> +
> + TP_ARGS(name, start, count, attempts, err),
> +
> + TP_STRUCT__entry(
> + __string(name, name)
> + __field(unsigned long, start)
> + __field(unsigned long, count)
> + __field(unsigned, attempts)
> + __field(int, err)
> + ),
> +
> + TP_fast_assign(
> + __assign_str(name, name);
> + __entry->start = start;
> + __entry->count = count;
> + __entry->attempts = attempts;
> + __entry->err = err;
> + ),
> +
> + TP_printk("name=%s start=%lx count=%lu attempts=%u err=%d",
> +   __get_str(name),
> +   __entry->start,
> +   __entry->count,
> +   __entry->attempts,
> +   __entry->err)
> +);
> +
>  TRACE_EVENT(cma_alloc_start,
>  
>   TP_PROTO(const char *name, unsigned long count, unsigned int align),
> diff --git a/mm/cma.c b/mm/cma.c
> index 543bb6b3be8e..4a0f68b9443b 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -416,6 +416,92 @@ static void cma_debug_show_areas(struct cma *cma)
>  static inline void cma_debug_show_areas(struct cma *cma) { }
>  #endif
>  
> +/**
> + * cma_alloc_range() - allocate pages in a specific range
> + * @cma:   Contiguous memory region for which the allocation is performed.
> + * @start: Starting pfn of the allocation.
> + * @count: Requested number

Re: [PATCH RFC v3 09/35] mm: cma: Introduce cma_remove_mem()

2024-01-29 Thread Anshuman Khandual




On 1/25/24 22:12, Alexandru Elisei wrote:
> Memory is added to CMA with cma_declare_contiguous_nid() and
> cma_init_reserved_mem(). This memory is then put on the MIGRATE_CMA list in
> cma_init_reserved_areas(), where the page allocator can make use of it.

cma_declare_contiguous_nid() reserves memory in memblock and marks the
for subsequent CMA usage, where as cma_init_reserved_areas() activates
these memory areas through init_cma_reserved_pageblock(). Standard page
allocator only receives these memory via free_reserved_page() - only if
the page block activation fails.

> 
> If a device manages multiple CMA areas, and there's an error when one of
> the areas is added to CMA, there is no mechanism for the device to prevent

What kind of error ? init_cma_reserved_pageblock() fails ? But that will
not happen until cma_init_reserved_areas().

> the rest of the areas, which were added before the error occured, from
> being later added to the MIGRATE_CMA list.

Why is this mechanism required ? cma_init_reserved_areas() scans over all
CMA areas and try and activate each of them sequentially. Why is not this
sufficient ?

> 
> Add cma_remove_mem() which allows a previously reserved CMA area to be
> removed and thus it cannot be used by the page allocator.

Successfully activated CMA areas do not get used by the buddy allocator.

> 
> Signed-off-by: Alexandru Elisei 
> ---
> 
> Changes since rfc v2:
> 
> * New patch.
> 
>  include/linux/cma.h |  1 +
>  mm/cma.c| 30 +-
>  2 files changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/cma.h b/include/linux/cma.h
> index e32559da6942..787cbec1702e 100644
> --- a/include/linux/cma.h
> +++ b/include/linux/cma.h
> @@ -48,6 +48,7 @@ extern int cma_init_reserved_mem(phys_addr_t base, 
> phys_addr_t size,
>   unsigned int order_per_bit,
>   const char *name,
>   struct cma **res_cma);
> +extern void cma_remove_mem(struct cma **res_cma);
>  extern struct page *cma_alloc(struct cma *cma, unsigned long count, unsigned 
> int align,
> bool no_warn);
>  extern int cma_alloc_range(struct cma *cma, unsigned long start, unsigned 
> long count,
> diff --git a/mm/cma.c b/mm/cma.c
> index 4a0f68b9443b..2881bab12b01 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -147,8 +147,12 @@ static int __init cma_init_reserved_areas(void)
>  {
>   int i;
>  
> - for (i = 0; i < cma_area_count; i++)
> + for (i = 0; i < cma_area_count; i++) {
> + /* Region was removed. */
> + if (!cma_areas[i].count)
> + continue;

Skip previously added CMA area (now zeroed out) ?

>   cma_activate_area(&cma_areas[i]);
> + }
>  
>   return 0;
>  }

cma_init_reserved_areas() gets called via core_initcall(). Some how
platform/device needs to call cma_remove_mem() before core_initcall()
gets called ? This might be time sensitive.

> @@ -216,6 +220,30 @@ int __init cma_init_reserved_mem(phys_addr_t base, 
> phys_addr_t size,
>   return 0;
>  }
>  
> +/**
> + * cma_remove_mem() - remove cma area
> + * @res_cma: Pointer to the cma region.
> + *
> + * This function removes a cma region created with cma_init_reserved_mem(). 
> The
> + * ->count is set to 0.
> + */
> +void __init cma_remove_mem(struct cma **res_cma)
> +{
> + struct cma *cma;
> +
> + if (WARN_ON_ONCE(!res_cma || !(*res_cma)))
> + return;
> +
> + cma = *res_cma;
> + if (WARN_ON_ONCE(!cma->count))
> + return;
> +
> + totalcma_pages -= cma->count;
> + cma->count = 0;
> +
> + *res_cma = NULL;
> +}
> +
>  /**
>   * cma_declare_contiguous_nid() - reserve custom contiguous area
>   * @base: Base address of the reserved area optional, use 0 for any

But first please do explain what are the errors device or platform might
see on a previously marked CMA area so that removing them on way becomes
necessary preventing their activation via cma_init_reserved_areas().

Re: [PATCH RFC v3 10/35] mm: cma: Fast track allocating memory when the pages are free

2024-01-30 Thread Anshuman Khandual




On 1/25/24 22:12, Alexandru Elisei wrote:
> If the pages to be allocated are free, take them directly off the buddy
> allocator, instead of going through alloc_contig_range() and avoiding
> costly calls to lru_cache_disable().
> 
> Only allocations of the same size as the CMA region order are considered,
> to avoid taking the zone spinlock for too long.
> 
> Signed-off-by: Alexandru Elisei 

This patch seems to be improving standard cma_alloc() as well as
the previously added new allocator i.e cma_alloc_range() - via a
new helper cma_alloc_pages_fastpath().

Should not any standard cma_alloc() improvement be discussed as
an independent patch separately irrespective of this series. OR
it is some how related to this series which I might be missing ?

> ---
> 
> Changes since rfc v2:
> 
> * New patch. Reworked from the rfc v2 patch #26 ("arm64: mte: Fast track
> reserving tag storage when the block is free") (David Hildenbrand).
> 
>  include/linux/page-flags.h | 15 --
>  mm/Kconfig |  5 +
>  mm/cma.c   | 42 ++
>  mm/memory-failure.c|  8 
>  mm/page_alloc.c| 23 -
>  5 files changed, 73 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 735cddc13d20..b7237bce7446 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -575,11 +575,22 @@ TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
>  #define MAGIC_HWPOISON   0x48575053U /* HWPS */
>  extern void SetPageHWPoisonTakenOff(struct page *page);
>  extern void ClearPageHWPoisonTakenOff(struct page *page);
> -extern bool take_page_off_buddy(struct page *page);
> -extern bool put_page_back_buddy(struct page *page);
> +extern bool PageHWPoisonTakenOff(struct page *page);
>  #else
>  PAGEFLAG_FALSE(HWPoison, hwpoison)
> +TESTSCFLAG_FALSE(HWPoison, hwpoison)
>  #define __PG_HWPOISON 0
> +static inline void SetPageHWPoisonTakenOff(struct page *page) { }
> +static inline void ClearPageHWPoisonTakenOff(struct page *page) { }
> +static inline bool PageHWPoisonTakenOff(struct page *page)
> +{
> +  return false;
> +}
> +#endif
> +
> +#ifdef CONFIG_WANTS_TAKE_PAGE_OFF_BUDDY
> +extern bool take_page_off_buddy(struct page *page, bool poison);
> +extern bool put_page_back_buddy(struct page *page, bool unpoison);
>  #endif
>  
>  #if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT)
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ffc3a2ba3a8c..341cf53898db 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -745,12 +745,16 @@ config DEFAULT_MMAP_MIN_ADDR
>  config ARCH_SUPPORTS_MEMORY_FAILURE
>   bool
>  
> +config WANTS_TAKE_PAGE_OFF_BUDDY
> + bool> +
>  config MEMORY_FAILURE
>   depends on MMU
>   depends on ARCH_SUPPORTS_MEMORY_FAILURE
>   bool "Enable recovery from hardware memory errors"
>   select MEMORY_ISOLATION
>   select RAS
> + select WANTS_TAKE_PAGE_OFF_BUDDY
>   help
> Enables code to recover from some memory failures on systems
> with MCA recovery. This allows a system to continue running
> @@ -891,6 +895,7 @@ config CMA
>   depends on MMU
>   select MIGRATION
>   select MEMORY_ISOLATION
> + select WANTS_TAKE_PAGE_OFF_BUDDY
>   help
> This enables the Contiguous Memory Allocator which allows other
> subsystems to allocate big physically-contiguous blocks of memory.
> diff --git a/mm/cma.c b/mm/cma.c
> index 2881bab12b01..15663f95d77b 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -444,6 +444,34 @@ static void cma_debug_show_areas(struct cma *cma)
>  static inline void cma_debug_show_areas(struct cma *cma) { }
>  #endif
>  
> +/* Called with the cma mutex held. */
> +static int cma_alloc_pages_fastpath(struct cma *cma, unsigned long start,
> + unsigned long end)
> +{
> + bool success = false;
> + unsigned long i, j;
> +
> + /* Avoid contention on the zone lock. */
> + if (start - end != 1 << cma->order_per_bit)
> + return -EINVAL;
> +
> + for (i = start; i < end; i++) {
> + if (!is_free_buddy_page(pfn_to_page(i)))
> + break;
> + success = take_page_off_buddy(pfn_to_page(i), false);
> + if (!success)
> + break;
> + }
> +
> + if (success)
> + return 0;
> +
> + for (j = start; j < i; j++)
> + put_page_back_buddy(pfn_to_page(j), false);
> +
> + return -EBUSY;
> +}
> +
>  /**
>   * cma_alloc_range() - allocate pages in a specific range
>   * @cma:   Contiguous memory region for which the allocation is performed.
> @@ -493,7 +521,11 @@ int cma_alloc_range(struct cma *cma, unsigned long 
> start, unsigned long count,
>  
>   for (i = 0; i < tries; i++) {
>   mutex_lock(&cma_mutex);
> - err = alloc_contig_range(start, start + count, MIGRATE_CMA, 
> gfp);
> +

Re: [PATCH RFC v3 11/35] mm: Allow an arch to hook into folio allocation when VMA is known

2024-01-30 Thread Anshuman Khandual




On 1/25/24 22:12, Alexandru Elisei wrote:
> arm64 uses VM_HIGH_ARCH_0 and VM_HIGH_ARCH_1 for enabling MTE for a VMA.
> When VM_HIGH_ARCH_0, which arm64 renames to VM_MTE, is set for a VMA, and
> the gfp flag __GFP_ZERO is present, the __GFP_ZEROTAGS gfp flag also gets
> set in vma_alloc_zeroed_movable_folio().
> 
> Expand this to be more generic by adding an arch hook that modifes the gfp
> flags for an allocation when the VMA is known.
> 
> Note that __GFP_ZEROTAGS is ignored by the page allocator unless __GFP_ZERO
> is also set; from that point of view, the current behaviour is unchanged,
> even though the arm64 flag is set in more places.  When arm64 will have
> support to reuse the tag storage for data allocation, the uses of the
> __GFP_ZEROTAGS flag will be expanded to instruct the page allocator to try
> to reserve the corresponding tag storage for the pages being allocated.

Right but how will pushing __GFP_ZEROTAGS addition into gfp_t flags further
down via a new arch call back i.e arch_calc_vma_gfp() while still maintaining
(vma->vm_flags & VM_MTE) conditionality improve the current scenario. Because
the page allocator could have still analyzed alloc flags for __GFP_ZEROTAGS
for any additional stuff.

OR this just adds some new core MM paths to get __GFP_ZEROTAGS which was not
the case earlier via this call back.

> 
> The flags returned by arch_calc_vma_gfp() are or'ed with the flags set by
> the caller; this has been done to keep an architecture from modifying the
> flags already set by the core memory management code; this is similar to
> how do_mmap() -> calc_vm_flag_bits() -> arch_calc_vm_flag_bits() has been
> implemented. This can be revisited in the future if there's a need to do
> so.
> 
> Signed-off-by: Alexandru Elisei 
> ---
>  arch/arm64/include/asm/page.h|  5 ++---
>  arch/arm64/include/asm/pgtable.h |  3 +++
>  arch/arm64/mm/fault.c| 19 ++-
>  include/linux/pgtable.h  |  7 +++
>  mm/mempolicy.c   |  1 +
>  mm/shmem.c   |  5 -
>  6 files changed, 23 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> index 2312e6ee595f..88bab032a493 100644
> --- a/arch/arm64/include/asm/page.h
> +++ b/arch/arm64/include/asm/page.h
> @@ -29,9 +29,8 @@ void copy_user_highpage(struct page *to, struct page *from,
>  void copy_highpage(struct page *to, struct page *from);
>  #define __HAVE_ARCH_COPY_HIGHPAGE
>  
> -struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
> - unsigned long vaddr);
> -#define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio
> +#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
> + vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
>  
>  void tag_clear_highpage(struct page *to);
>  #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 79ce70fbb751..08f0904dbfc2 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1071,6 +1071,9 @@ static inline void arch_swap_restore(swp_entry_t entry, 
> struct folio *folio)
>  
>  #endif /* CONFIG_ARM64_MTE */
>  
> +#define __HAVE_ARCH_CALC_VMA_GFP
> +gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp);
> +
>  /*
>   * On AArch64, the cache coherency is handled via the set_pte_at() function.
>   */
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 55f6455a8284..4d3f0a870ad8 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -937,22 +937,15 @@ void do_debug_exception(unsigned long 
> addr_if_watchpoint, unsigned long esr,
>  NOKPROBE_SYMBOL(do_debug_exception);
>  
>  /*
> - * Used during anonymous page fault handling.
> + * If this is called during anonymous page fault handling, and the page is
> + * mapped with PROT_MTE, initialise the tags at the point of tag zeroing as 
> this
> + * is usually faster than separate DC ZVA and STGM.
>   */
> -struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
> - unsigned long vaddr)
> +gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp)
>  {
> - gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
> -
> - /*
> -  * If the page is mapped with PROT_MTE, initialise the tags at the
> -  * point of allocation and page zeroing as this is usually faster than
> -  * separate DC ZVA and STGM.
> -  */
>   if (vma->vm_flags & VM_MTE)
> - flags |= __GFP_ZEROTAGS;
> -
> - return vma_alloc_folio(flags, 0, vma, vaddr, false);
> + return __GFP_ZEROTAGS;
> + return 0;
>  }
>  
>  void tag_clear_highpage(struct page *page)
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index c5ddec6b5305..98f81ca08cbe 100644
> --- a/include/linux/pgtable.h
> +++ b/include/

Re: [PATCH RFC v3 04/35] mm: page_alloc: Partially revert "mm: page_alloc: remove stale CMA guard code"

2024-01-30 Thread Anshuman Khandual




On 1/30/24 17:27, Alexandru Elisei wrote:
> Hi,
> 
> On Tue, Jan 30, 2024 at 10:04:02AM +0530, Anshuman Khandual wrote:
>>
>>
>> On 1/29/24 17:16, Alexandru Elisei wrote:
>>> Hi,
>>>
>>> On Mon, Jan 29, 2024 at 02:31:23PM +0530, Anshuman Khandual wrote:
>>>>
>>>>
>>>> On 1/25/24 22:12, Alexandru Elisei wrote:
>>>>> The patch f945116e4e19 ("mm: page_alloc: remove stale CMA guard code")
>>>>> removed the CMA filter when allocating from the MIGRATE_MOVABLE pcp list
>>>>> because CMA is always allowed when __GFP_MOVABLE is set.
>>>>>
>>>>> With the introduction of the arch_alloc_cma() function, the above is not
>>>>> true anymore, so bring back the filter.
>>>>
>>>> This makes sense as arch_alloc_cma() now might prevent ALLOC_CMA being
>>>> assigned to alloc_flags in gfp_to_alloc_flags_cma().
>>>
>>> Can I add your Reviewed-by tag then?
>>
>> I think all these changes need to be reviewed in their entirety
>> even though some patches do look good on their own. For example
>> this patch depends on whether [PATCH 03/35] is acceptable or not.
>>
>> I would suggest separating out CMA patches which could be debated
>> and merged regardless of this series.
> 
> Ah, I see, makes sense. Since basically all the core mm changes are there
> to enable dynamic tag storage for arm64, I'll hold on until the series
> stabilises before separating the core mm from the arm64 patches.

Fair enough but at least could you please separate out this particular
patch right away and send across. 

mm: cma: Don't append newline when generating CMA area name

Re: [PATCH RFC v3 06/35] mm: cma: Make CMA_ALLOC_SUCCESS/FAIL count the number of pages

2024-01-30 Thread Anshuman Khandual




On 1/30/24 17:28, Alexandru Elisei wrote:
> Hi,
> 
> On Tue, Jan 30, 2024 at 10:22:11AM +0530, Anshuman Khandual wrote:
>>
>> On 1/29/24 17:21, Alexandru Elisei wrote:
>>> Hi,
>>>
>>> On Mon, Jan 29, 2024 at 02:54:20PM +0530, Anshuman Khandual wrote:
>>>>
>>>> On 1/25/24 22:12, Alexandru Elisei wrote:
>>>>> The CMA_ALLOC_SUCCESS, respectively CMA_ALLOC_FAIL, are increased by one
>>>>> after each cma_alloc() function call. This is done even though cma_alloc()
>>>>> can allocate an arbitrary number of CMA pages. When looking at
>>>>> /proc/vmstat, the number of successful (or failed) cma_alloc() calls
>>>>> doesn't tell much with regards to how many CMA pages were allocated via
>>>>> cma_alloc() versus via the page allocator (regular allocation request or
>>>>> PCP lists refill).
>>>>>
>>>>> This can also be rather confusing to a user who isn't familiar with the
>>>>> code, since the unit of measurement for nr_free_cma is the number of 
>>>>> pages,
>>>>> but cma_alloc_success and cma_alloc_fail count the number of cma_alloc()
>>>>> function calls.
>>>>>
>>>>> Let's make this consistent, and arguably more useful, by having
>>>>> CMA_ALLOC_SUCCESS count the number of successfully allocated CMA pages, 
>>>>> and
>>>>> CMA_ALLOC_FAIL count the number of pages the cma_alloc() failed to
>>>>> allocate.
>>>>>
>>>>> For users that wish to track the number of cma_alloc() calls, there are
>>>>> tracepoints for that already implemented.
>>>>>
>>>>> Signed-off-by: Alexandru Elisei 
>>>>> ---
>>>>>  mm/cma.c | 4 ++--
>>>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/mm/cma.c b/mm/cma.c
>>>>> index f49c95f8ee37..dbf7fe8cb1bd 100644
>>>>> --- a/mm/cma.c
>>>>> +++ b/mm/cma.c
>>>>> @@ -517,10 +517,10 @@ struct page *cma_alloc(struct cma *cma, unsigned 
>>>>> long count,
>>>>>   pr_debug("%s(): returned %p\n", __func__, page);
>>>>>  out:
>>>>>   if (page) {
>>>>> - count_vm_event(CMA_ALLOC_SUCCESS);
>>>>> + count_vm_events(CMA_ALLOC_SUCCESS, count);
>>>>>   cma_sysfs_account_success_pages(cma, count);
>>>>>   } else {
>>>>> - count_vm_event(CMA_ALLOC_FAIL);
>>>>> + count_vm_events(CMA_ALLOC_FAIL, count);
>>>>>   if (cma)
>>>>>   cma_sysfs_account_fail_pages(cma, count);
>>>>>   }
>>>> Without getting into the merits of this patch - which is actually trying 
>>>> to do
>>>> semantics change to /proc/vmstat, wondering how is this even related to 
>>>> this
>>>> particular series ? If required this could be debated on it's on 
>>>> separately.
>>> Having the number of CMA pages allocated and the number of CMA pages freed
>>> allows someone to infer how many tagged pages are in use at a given time:
>> That should not be done in CMA which is a generic multi purpose allocator.

> Ah, ok. Let me rephrase that: Having the number of CMA pages allocated, the
> number of failed CMA page allocations and the number of freed CMA pages
> allows someone to infer how many CMA pages are in use at a given time.
> That's valuable information for software designers and system
> administrators, as it allows them to tune the number of CMA pages available
> in a system.
> 
> Or put another way: what would you consider to be more useful?  Knowing the
> number of cma_alloc()/cma_release() calls, or knowing the number of pages
> that cma_alloc()/cma_release() allocated or freed?

There is still value in knowing how many times cma_alloc() succeeded or failed
regardless of the cumulative number pages involved over the time. Actually the
count helps to understand how cma_alloc() performed overall as an allocator.

But on the cma_release() path there is no chances of failure apart from - just
when the caller itself provides an wrong input. So there are no corresponding
CMA_RELEASE_SUCCESS/CMA_RELEASE_FAIL vmstat counters in there - for a reason !

Coming back to CMA based pages being allocated and freed, there is already an
interface via sysfs (CONFIG_CMA_SYSFS) which gets updated in cma_alloc() path
via cma_sysfs_account_success_pages() and cma_sysfs_account_fail_pages().

#ls /sys/kernel/mm/cma/
alloc_pages_fail alloc_pages_success

Why these counters could not meet your requirements ? Also 'struct cma' can
be updated to add an element 'nr_pages_freed' to be tracked in cma_release(),
providing free pages count as well.

There are additional debug fs based elements (CONFIG_CMA_DEBUGFS) available.

#ls /sys/kernel/debug/cma/
alloc  base_pfn  bitmap  count  free  maxchunk  order_per_bit  used

Re: [PATCH RFC v3 07/35] mm: cma: Add CMA_RELEASE_{SUCCESS,FAIL} events

2024-01-30 Thread Anshuman Khandual




On 1/29/24 17:23, Alexandru Elisei wrote:
> Hi,
> 
> On Mon, Jan 29, 2024 at 03:01:24PM +0530, Anshuman Khandual wrote:
>>
>> On 1/25/24 22:12, Alexandru Elisei wrote:
>>> Similar to the two events that relate to CMA allocations, add the
>>> CMA_RELEASE_SUCCESS and CMA_RELEASE_FAIL events that count when CMA pages
>>> are freed.
>> How is this is going to be beneficial towards analyzing CMA alloc/release
>> behaviour - particularly with respect to this series. OR just adding this
>> from parity perspective with CMA alloc side counters ? Regardless this
>> CMA change too could be discussed separately.
> Added for parity and because it's useful for this series (see my reply to
> the previous patch where I discuss how I've used the counters).

As mentioned earlier, a new CONFIG_CMA_SYSFS element 'cma->nr_freed_pages'
could be instrumented in cma_release()'s success path for this purpose.
But again the failure path is not of much value as it could only happen
when there is an invalid input from the caller i.e when cma_pages_valid()
check fails.

Re: [PATCH RFC v3 08/35] mm: cma: Introduce cma_alloc_range()

2024-01-30 Thread Anshuman Khandual




On 1/30/24 17:05, Alexandru Elisei wrote:
> Hi,
> 
> On Tue, Jan 30, 2024 at 10:50:00AM +0530, Anshuman Khandual wrote:
>>
>> On 1/25/24 22:12, Alexandru Elisei wrote:
>>> Today, cma_alloc() is used to allocate a contiguous memory region. The
>>> function allows the caller to specify the number of pages to allocate, but
>>> not the starting address. cma_alloc() will walk over the entire CMA region
>>> trying to allocate the first available range of the specified size.
>>>
>>> Introduce cma_alloc_range(), which makes CMA more versatile by allowing the
>>> caller to specify a particular range in the CMA region, defined by the
>>> start pfn and the size.
>>>
>>> arm64 will make use of this function when tag storage management will be
>>> implemented: cma_alloc_range() will be used to reserve the tag storage
>>> associated with a tagged page.
>> Basically, you would like to pass on a preferred start address and the
>> allocation could just fail if a contig range is not available from such
>> a starting address ?
>>
>> Then why not just change cma_alloc() to take a new argument 'start_pfn'.
>> Why create a new but almost similar allocator ?
> I tried doing that, and I gave up because:
> 
> - It made cma_alloc() even more complex and hard to follow.
> 
> - What value should 'start_pfn' be to tell cma_alloc() that it should be
>   ignored? Or, to put it another way, what pfn number is invalid on **all**
>   platforms that Linux supports?
> 
> I can give it another go if we can come up with an invalid value for
> 'start_pfn'.

Something negative might work. How about -1/-1UL ? A quick search gives
some instances such as ...

git grep "pfn == -1"

mm/mm_init.c:   if (*start_pfn == -1UL)
mm/vmscan.c:if (pfn == -1)
mm/vmscan.c:if (pfn == -1)
mm/vmscan.c:if (pfn == -1)
tools/testing/selftests/mm/hugepage-vmemmap.c:  if (pfn == -1UL) {

Could not -1UL be abstracted as common macro MM_INVALID_PFN to be used in
such scenarios including here ?

> 
>> But then I am wondering why this could not be done in the arm64 platform
>> code itself operating on a CMA area reserved just for tag storage. Unless
>> this new allocator has other usage beyond MTE, this could be implemented
>> in the platform itself.
> I had the same idea in the previous iteration, David Hildenbrand suggested
> this approach [1].
> 
> [1] 
> https://lore.kernel.org/linux-fsdevel/2aafd53f-af1f-45f3-a08c-d11962254...@redhat.com/

There are two different cma_alloc() proposals here - including the next
patch i.e mm: cma: Fast track allocating memory when the pages are free

1) Augment cma_alloc() or add cma_alloc_range() with start_pfn parameter
2) Speed up cma_alloc() for small allocation requests when pages are free

The second one if separated out from this series could be considered on
its own as it will help all existing cma_alloc() callers. The first one
definitely needs an use case as provided in this series.

Re: [PATCH RFC v3 11/35] mm: Allow an arch to hook into folio allocation when VMA is known

2024-01-30 Thread Anshuman Khandual




On 1/30/24 17:04, Alexandru Elisei wrote:
> Hi,
> 
> On Tue, Jan 30, 2024 at 03:25:20PM +0530, Anshuman Khandual wrote:
>>
>> On 1/25/24 22:12, Alexandru Elisei wrote:
>>> arm64 uses VM_HIGH_ARCH_0 and VM_HIGH_ARCH_1 for enabling MTE for a VMA.
>>> When VM_HIGH_ARCH_0, which arm64 renames to VM_MTE, is set for a VMA, and
>>> the gfp flag __GFP_ZERO is present, the __GFP_ZEROTAGS gfp flag also gets
>>> set in vma_alloc_zeroed_movable_folio().
>>>
>>> Expand this to be more generic by adding an arch hook that modifes the gfp
>>> flags for an allocation when the VMA is known.
>>>
>>> Note that __GFP_ZEROTAGS is ignored by the page allocator unless __GFP_ZERO
>>> is also set; from that point of view, the current behaviour is unchanged,
>>> even though the arm64 flag is set in more places.  When arm64 will have
>>> support to reuse the tag storage for data allocation, the uses of the
>>> __GFP_ZEROTAGS flag will be expanded to instruct the page allocator to try
>>> to reserve the corresponding tag storage for the pages being allocated.
>> Right but how will pushing __GFP_ZEROTAGS addition into gfp_t flags further
>> down via a new arch call back i.e arch_calc_vma_gfp() while still maintaining
>> (vma->vm_flags & VM_MTE) conditionality improve the current scenario. Because
> I'm afraid I don't follow you.

I was just asking whether the overall scope of __GFP_ZEROTAGS flag is being
increased to cover more core MM paths through this patch. I think you have
already answered that below.

> 
>> the page allocator could have still analyzed alloc flags for __GFP_ZEROTAGS
>> for any additional stuff.
>>
>> OR this just adds some new core MM paths to get __GFP_ZEROTAGS which was not
>> the case earlier via this call back.
> Before this patch: vma_alloc_zeroed_movable_folio() sets __GFP_ZEROTAGS.
> After this patch: vma_alloc_folio() sets __GFP_ZEROTAGS.

Understood.

> 
> This patch is about adding __GFP_ZEROTAGS for more callers.

Right, I guess that is the real motivation for this patch. But just wondering
does this cover all possible anon fault paths for converting given vma_flag's
VM_MTE flag into page alloc flag __GFP_ZEROTAGS ? Aren't there any other file
besides (mm/shmem.c) which needs to be changed to include arch_calc_vma_gfp() ?

Re: [PATCH RFC v3 09/35] mm: cma: Introduce cma_remove_mem()

2024-01-31 Thread Anshuman Khandual

On 1/30/24 17:03, Alexandru Elisei wrote:
> Hi,
> 
> I really appreciate the feedback you have given me so far. I believe the
> commit message isn't clear enough and there has been a confusion.
> 
> A CMA user adds a CMA area to the cma_areas array with
> cma_declare_contiguous_nid() or cma_init_reserved_mem().
> init_cma_reserved_pageblock() then iterates over the array and activates
> all cma areas.

Agreed.

> 
> The function cma_remove_mem() is intended to be used to remove a cma area
> from the cma_areas array **before** the area has been activated.

Understood.

> 
> Usecase: a driver (in this case, the arm64 dynamic tag storage code)
> manages several cma areas. The driver successfully adds the first area to
> the cma_areas array. When the driver tries to adds the second area, the
> function fails. Without cma_remove_mem(), the driver has no way to prevent
> the first area from being freed to the page allocator. cma_remove_mem() is
> about providing a means to do cleanup in case of error.
> 
> Does that make more sense now?

How to ensure that cma_remove_mem() should get called by the driver before
core_initcall()---> cma_init_reserved_areas()---> cma_activate_area() chain
happens. Else cma_remove_mem() will miss out to clear cma->count and given
area will proceed to get activated like always.


> 
> Ok Tue, Jan 30, 2024 at 11:20:56AM +0530, Anshuman Khandual wrote:
>>
>>
>> On 1/25/24 22:12, Alexandru Elisei wrote:
>>> Memory is added to CMA with cma_declare_contiguous_nid() and
>>> cma_init_reserved_mem(). This memory is then put on the MIGRATE_CMA list in
>>> cma_init_reserved_areas(), where the page allocator can make use of it.
>>
>> cma_declare_contiguous_nid() reserves memory in memblock and marks the
> 
> You forgot about about cma_init_reserved_mem() which does the same thing,
> but yes, you are right.

Agreed, missed that. There are some direct cma_init_reserved_mem() calls as 
well.

> 
>> for subsequent CMA usage, where as cma_init_reserved_areas() activates
>> these memory areas through init_cma_reserved_pageblock(). Standard page
>> allocator only receives these memory via free_reserved_page() - only if
> 
> I don't think that's correct. init_cma_reserved_pageblock() clears the
> PG_reserved page flag, sets the migratetype to MIGRATE_CMA and then frees
> the page. After that, the page is available to the standard page allocator
> to use for allocation. Otherwise, what would be the point of the
> MIGRATE_CMA migratetype?

Understood and agreed.

> 
>> the page block activation fails.
> 
> For the sake of having a complete picture, I'll add that that only happens
> if cma->reserve_pages_on_error is false. If the CMA user sets the field to
> 'true' (with cma_reserve_pages_on_error()), then the pages in the CMA
> region are kept PG_reserved if activation fails.

Why cannot you use cma_reserve_pages_on_error() ?

> 
>>
>>>
>>> If a device manages multiple CMA areas, and there's an error when one of
>>> the areas is added to CMA, there is no mechanism for the device to prevent
>>
>> What kind of error ? init_cma_reserved_pageblock() fails ? But that will
>> not happen until cma_init_reserved_areas().
> 
> I think I haven't been clear enough. When I say that "an area is added
> to CMA", I mean that the memory region is added to cma_areas array, via
> cma_declare_contiguous_nid() or cma_init_reserved_mem(). There are several
> ways in which either function can fail.

Okay.

> 
>>
>>> the rest of the areas, which were added before the error occured, from
>>> being later added to the MIGRATE_CMA list.
>>
>> Why is this mechanism required ? cma_init_reserved_areas() scans over all
>> CMA areas and try and activate each of them sequentially. Why is not this
>> sufficient ?
> 
> This patch is about removing a struct cma from the cma_areas array after it
> has been added to the array, with cma_declare_contiguous_nid() or
> cma_init_reserved_mem(), to prevent the area from being activated in
> cma_init_reserved_areas(). Sorry for the confusion.
> 
> I'll add a check in cma_remove_mem() to fail if the cma area has been
> activated, and a comment to the function to explain its usage.

That will be a good check.

> 
>>
>>>
>>> Add cma_remove_mem() which allows a previously reserved CMA area to be
>>> removed and thus it cannot be used by the page allocator.
>>
>> Successfully activated CMA areas do not get used by the buddy allocator.
> 
> I don't believe that is correct, see above.
Apologies, it's my bad.

> 
>>
>>

Re: [PATCH RFC v3 12/35] mm: Call arch_swap_prepare_to_restore() before arch_swap_restore()

2024-01-31 Thread Anshuman Khandual

On 1/25/24 22:12, Alexandru Elisei wrote:
> arm64 uses arch_swap_restore() to restore saved tags before the page is
> swapped in and it's called in atomic context (with the ptl lock held).
> 
> Introduce arch_swap_prepare_to_restore() that will allow an architecture to
> perform extra work during swap in and outside of a critical section.
> This will be used by arm64 to allocate a buffer in memory where to
> temporarily save tags if tag storage is not available for the page being
> swapped in.

Just wondering if tag storage will always be unavailable for tagged pages
being swapped in ? OR there are cases where allocation might not even be
required ? This prepare phase needs to be outside the critical section -
only because there might be memory allocations ?

Re: [PATCH RFC v3 13/35] mm: memory: Introduce fault-on-access mechanism for pages

2024-01-31 Thread Anshuman Khandual

On 1/25/24 22:12, Alexandru Elisei wrote:
> Introduce a mechanism that allows an architecture to trigger a page fault,
> and add the infrastructure to handle that fault accordingly. To use make> use 
> of this, an arch is expected to mark the table entry as PAGE_NONE (which
> will cause a fault next time it is accessed) and to implement an
> arch-specific method (like a software bit) for recognizing that the fault
> needs to be handled by the arch code.
> 
> arm64 will use of this approach to reserve tag storage for pages which are
> mapped in an MTE enabled VMA, but the storage needed to store tags isn't
> reserved (for example, because of an mprotect(PROT_MTE) call on a VMA with
> existing pages).

Just to summerize -

So platform will create NUMA balancing like page faults - via marking existing
mappings with PAGE_NONE permission, when the subsequent fault happens identify
such cases via a software bit in the page table entry and then route the fault
to the platform code itself for special purpose page fault handling where page
might come from some reserved areas instead.

Some questions

- How often PAGE_NONE is to be marked for applicable MTE VMA based mappings 

- Is it periodic like NUMA balancing or just one time for tag storage

- How this is going to interact with NUMA balancing given both use PAGE_NONE

- How to differentiate these mappings from standard pte_protnone()

> 
> Signed-off-by: Alexandru Elisei 
> ---
> 
> Changes since rfc v2:
> 
> * New patch. Split from patch #19 ("mm: mprotect: Introduce 
> PAGE_FAULT_ON_ACCESS
> for mprotect(PROT_MTE)") (David Hildenbrand).
> 
>  include/linux/huge_mm.h |  4 ++--
>  include/linux/pgtable.h | 47 +++--
>  mm/Kconfig  |  3 +++
>  mm/huge_memory.c| 36 +
>  mm/memory.c | 51 ++---
>  5 files changed, 109 insertions(+), 32 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 5adb86af35fc..4678a0a5e6a8 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -346,7 +346,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct 
> *vma, unsigned long addr,
>  struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long 
> addr,
>   pud_t *pud, int flags, struct dev_pagemap **pgmap);
>  
> -vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
> +vm_fault_t handle_huge_pmd_protnone(struct vm_fault *vmf);
>  
>  extern struct page *huge_zero_page;
>  extern unsigned long huge_zero_pfn;
> @@ -476,7 +476,7 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
>   return NULL;
>  }
>  
> -static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> +static inline vm_fault_t handle_huge_pmd_protnone(struct vm_fault *vmf)
>  {
>   return 0;
>  }
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 2d0f04042f62..81a21be855a2 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1455,7 +1455,7 @@ static inline int pud_trans_unstable(pud_t *pud)
>   return 0;
>  }
>  
> -#ifndef CONFIG_NUMA_BALANCING
> +#if !defined(CONFIG_NUMA_BALANCING) && 
> !defined(CONFIG_ARCH_HAS_FAULT_ON_ACCESS)
>  /*
>   * In an inaccessible (PROT_NONE) VMA, pte_protnone() may indicate "yes". It 
> is
>   * perfectly valid to indicate "no" in that case, which is why our default
> @@ -1477,7 +1477,50 @@ static inline int pmd_protnone(pmd_t pmd)
>  {
>   return 0;
>  }
> -#endif /* CONFIG_NUMA_BALANCING */
> +#endif /* !CONFIG_NUMA_BALANCING && !CONFIG_ARCH_HAS_FAULT_ON_ACCESS */
> +
> +#ifndef CONFIG_ARCH_HAS_FAULT_ON_ACCESS
> +static inline bool arch_fault_on_access_pte(pte_t pte)
> +{
> + return false;
> +}
> +
> +static inline bool arch_fault_on_access_pmd(pmd_t pmd)
> +{
> + return false;
> +}
> +
> +/*
> + * The function is called with the fault lock held and an elevated reference 
> on
> + * the folio.
> + *
> + * Rules that an arch implementation of the function must follow:
> + *
> + * 1. The function must return with the elevated reference dropped.
> + *
> + * 2. If the return value contains VM_FAULT_RETRY or VM_FAULT_COMPLETED then:
> + *
> + * - if FAULT_FLAG_RETRY_NOWAIT is not set, the function must return with the
> + *   correct fault lock released, which can be accomplished with
> + *   release_fault_lock(vmf). Note that release_fault_lock() doesn't check if
> + *   FAULT_FLAG_RETRY_NOWAIT is set before releasing the mmap_lock.
> + *
> + * - if FAULT_FLAG_RETRY_NOWAIT is set, then the function must not release 
> the
> + *   mmap_lock. The flag should be set only if the mmap_lock is held.
> + *
> + * 3. If the return value contains neither of the above, the function must 
> not
> + * release the fault lock; the generic fault handler will take care of 
> releasing
> + * the correct lock.
> + */
> +static inline vm_fault_t arch_handle_folio_fault_on_access(struct folio 
> *

Re: [PATCH RFC v3 31/35] khugepaged: arm64: Don't collapse MTE enabled VMAs

2024-02-01 Thread Anshuman Khandual




On 1/25/24 22:12, Alexandru Elisei wrote:
> copy_user_highpage() will do memory allocation if there are saved tags for
> the destination page, and the page is missing tag storage.
> 
> After commit a349d72fd9ef ("mm/pgtable: add rcu_read_lock() and
> rcu_read_unlock()s"), collapse_huge_page() calls
> __collapse_huge_page_copy() -> .. -> copy_user_highpage() with the RCU lock
> held, which means that copy_user_highpage() can only allocate memory using
> GFP_ATOMIC or equivalent.
> 
> Get around this by refusing to collapse pages into a transparent huge page
> if the VMA is MTE-enabled.

Makes sense when copy_user_highpage() will allocate memory for tag storage.

> 
> Signed-off-by: Alexandru Elisei 
> ---
> 
> Changes since rfc v2:
> 
> * New patch. I think an agreement on whether copy*_user_highpage() should be
> always allowed to sleep, or should not be allowed, would be useful.

This is a good question ! Even after preventing the collapse of MTE VMA here,
there still might be more paths where a sleeping (i.e memory allocating)
copy*_user_highpage() becomes problematic ?

> 
>  arch/arm64/include/asm/pgtable.h| 3 +++
>  arch/arm64/kernel/mte_tag_storage.c | 5 +
>  include/linux/khugepaged.h  | 5 +
>  mm/khugepaged.c | 4 
>  4 files changed, 17 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 87ae59436162..d0473538c926 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1120,6 +1120,9 @@ static inline bool arch_alloc_cma(gfp_t gfp_mask)
>   return true;
>  }
>  
> +bool arch_hugepage_vma_revalidate(struct vm_area_struct *vma, unsigned long 
> address);
> +#define arch_hugepage_vma_revalidate arch_hugepage_vma_revalidate
> +
>  #endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
>  #endif /* CONFIG_ARM64_MTE */
>  
> diff --git a/arch/arm64/kernel/mte_tag_storage.c 
> b/arch/arm64/kernel/mte_tag_storage.c
> index ac7b9c9c585c..a99959b70573 100644
> --- a/arch/arm64/kernel/mte_tag_storage.c
> +++ b/arch/arm64/kernel/mte_tag_storage.c
> @@ -636,3 +636,8 @@ void arch_alloc_page(struct page *page, int order, gfp_t 
> gfp)
>   if (tag_storage_enabled() && alloc_requires_tag_storage(gfp))
>   reserve_tag_storage(page, order, gfp);
>  }
> +
> +bool arch_hugepage_vma_revalidate(struct vm_area_struct *vma, unsigned long 
> address)
> +{
> + return !(vma->vm_flags & VM_MTE);
> +}
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index f68865e19b0b..461e4322dff2 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -38,6 +38,11 @@ static inline void khugepaged_exit(struct mm_struct *mm)
>   if (test_bit(MMF_VM_HUGEPAGE, &mm->flags))
>   __khugepaged_exit(mm);
>  }
> +
> +#ifndef arch_hugepage_vma_revalidate
> +#define arch_hugepage_vma_revalidate(vma, address) 1

Please replace s/1/true as arch_hugepage_vma_revalidate() returns bool ?

> +#endif

Right, above construct is much better than __HAVE_ARCH_ based one.

> +
>  #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct 
> *oldmm)
>  {
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2b219acb528e..cb9a9ddb4d86 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -935,6 +935,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, 
> unsigned long address,
>*/
>   if (expect_anon && (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap)))
>   return SCAN_PAGE_ANON;
> +
> + if (!arch_hugepage_vma_revalidate(vma, address))
> + return SCAN_VMA_CHECK;
> +
>   return SCAN_SUCCEED;
>  }
>  

Otherwise this LGTM.

Re: [PATCH RFC v3 30/35] arm64: mte: ptrace: Handle pages with missing tag storage

2024-02-01 Thread Anshuman Khandual




On 1/25/24 22:12, Alexandru Elisei wrote:
> A page can end up mapped in a MTE enabled VMA without the corresponding tag
> storage block reserved. Tag accesses made by ptrace in this case can lead
> to the wrong tags being read or memory corruption for the process that is
> using the tag storage memory as data.
> 
> Reserve tag storage by treating ptrace accesses like a fault.
> 
> Signed-off-by: Alexandru Elisei 
> ---
> 
> Changes since rfc v2:
> 
> * New patch, issue reported by Peter Collingbourne.
> 
>  arch/arm64/kernel/mte.c | 26 --
>  1 file changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index faf09da3400a..b1fa02dad4fd 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -412,10 +412,13 @@ static int __access_remote_tags(struct mm_struct *mm, 
> unsigned long addr,
>   while (len) {
>   struct vm_area_struct *vma;
>   unsigned long tags, offset;
> + unsigned int fault_flags;
> + struct page *page;
> + vm_fault_t ret;
>   void *maddr;
> - struct page *page = get_user_page_vma_remote(mm, addr,
> -  gup_flags, &vma);
>  
> +get_page:
> + page = get_user_page_vma_remote(mm, addr, gup_flags, &vma);

But if there is valid page returned here in the first GUP attempt, will there
still be a subsequent handle_mm_fault() on the same vma and addr ?

>   if (IS_ERR(page)) {
>   err = PTR_ERR(page);
>   break;
> @@ -433,6 +436,25 @@ static int __access_remote_tags(struct mm_struct *mm, 
> unsigned long addr,
>   put_page(page);
>   break;
>   }
> +
> + if (tag_storage_enabled() && !page_tag_storage_reserved(page)) {

Should not '!page' be checked here as well ?

> + fault_flags = FAULT_FLAG_DEFAULT | \
> +   FAULT_FLAG_USER | \
> +   FAULT_FLAG_REMOTE | \
> +   FAULT_FLAG_ALLOW_RETRY | \
> +   FAULT_FLAG_RETRY_NOWAIT;
> + if (write)
> + fault_flags |= FAULT_FLAG_WRITE;
> +
> + put_page(page);
> + ret = handle_mm_fault(vma, addr, fault_flags, NULL);
> + if (ret & VM_FAULT_ERROR) {
> + err = -EFAULT;
> + break;
> + }
> + goto get_page;
> + }
> +
>   WARN_ON_ONCE(!page_mte_tagged(page));
>  
>   /* limit access to the end of the page */

Re: [PATCH] arm64: mm: correct the start of physical address in linear map

2021-02-14 Thread Anshuman Khandual

Hello Pavel,

On 2/13/21 6:53 AM, Pavel Tatashin wrote:
> Memory hotplug may fail on systems with CONFIG_RANDOMIZE_BASE because the
> linear map range is not checked correctly.
> 
> The start physical address that linear map covers can be actually at the
> end of the range because of randmomization. Check that and if so reduce it
> to 0.

Looking at the code, this seems possible if memstart_addr which is a signed
value becomes large (after falling below 0) during arm64_memblock_init().

> 
> This can be verified on QEMU with setting kaslr-seed to ~0ul:
> 
> memstart_offset_seed = 0x
> START: __pa(_PAGE_OFFSET(vabits_actual)) = 9000c000
> END:   __pa(PAGE_END - 1) =  1000bfff
> 
> Signed-off-by: Pavel Tatashin 
> Fixes: 58284a901b42 ("arm64/mm: Validate hotplug range before creating linear 
> mapping")
> ---
>  arch/arm64/mm/mmu.c | 15 +--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index ae0c3d023824..6057ecaea897 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1444,14 +1444,25 @@ static void __remove_pgd_mapping(pgd_t *pgdir, 
> unsigned long start, u64 size)
>  
>  static bool inside_linear_region(u64 start, u64 size)
>  {
> + u64 start_linear_pa = __pa(_PAGE_OFFSET(vabits_actual));
> + u64 end_linear_pa = __pa(PAGE_END - 1);
> +
> + /*
> +  * Check for a wrap, it is possible because of randomized linear mapping
> +  * the start physical address is actually bigger than the end physical
> +  * address. In this case set start to zero because [0, end_linear_pa]
> +  * range must still be able to cover all addressable physical addresses.
> +  */

If this is possible only with randomized linear mapping, could you please
add IS_ENABLED(CONFIG_RANDOMIZED_BASE) during the switch over. Wondering
if WARN_ON(start_linear_pa > end_linear_pa) should be added otherwise i.e
when linear mapping randomization is not enabled.

> + if (start_linear_pa > end_linear_pa)
> + start_linear_pa = 0;

This looks okay but will double check and give it some more testing.

> +
>   /*
>* Linear mapping region is the range [PAGE_OFFSET..(PAGE_END - 1)]
>* accommodating both its ends but excluding PAGE_END. Max physical
>* range which can be mapped inside this linear mapping range, must
>* also be derived from its end points.
>*/
> - return start >= __pa(_PAGE_OFFSET(vabits_actual)) &&
> -(start + size - 1) <= __pa(PAGE_END - 1);
> + return start >= start_linear_pa && (start + size - 1) <= end_linear_pa;
>  }
>  
>  int arch_add_memory(int nid, u64 start, u64 size,
> 

- Anshuman

Re: [PATCH V3 11/14] coresight: sink: Add TRBE driver

2021-02-15 Thread Anshuman Khandual




On 2/12/21 10:27 PM, Mathieu Poirier wrote:
> [...]
> 
>>>
>>>
 +  if (nr_pages < 2)
 +  return NULL;
 +
 +  buf = kzalloc_node(sizeof(*buf), GFP_KERNEL, trbe_alloc_node(event));
 +  if (IS_ERR(buf))
 +  return ERR_PTR(-ENOMEM);
 +
 +  pglist = kcalloc(nr_pages, sizeof(*pglist), GFP_KERNEL);
 +  if (IS_ERR(pglist)) {
 +  kfree(buf);
 +  return ERR_PTR(-ENOMEM);
 +  }
 +
 +  for (i = 0; i < nr_pages; i++)
 +  pglist[i] = virt_to_page(pages[i]);
 +
 +  buf->trbe_base = (unsigned long) vmap(pglist, nr_pages, VM_MAP, 
 PAGE_KERNEL);
 +  if (IS_ERR((void *)buf->trbe_base)) {
>>>
>>> Why not simply make buf->trbe_base a void * instead of having to do all this
>>
>> There are many arithmetic and comparison operations involving trbe_base
>> element. Hence it might be better to keep it as unsigned long, also to
>> keeps it consistent with other pointers i.e trbe_write, trbe_limit.
> 
> That is a fair point.  Please add a comment to explain your design choice and
> make sure the sparse checker is happy with all of it.

Added a comment.

> 
>>
>> Snippet from $cat drivers/hwtracing/coresight/coresight-trbe.c | grep 
>> "trbe_base"
>> There are just two places type casting trbe_base back to (void *).
>>
>>  memset((void *)buf->trbe_base + head, ETE_IGNORE_PACKET, len);
>>  return buf->trbe_base + offset;
>>  WARN_ON(buf->trbe_write < buf->trbe_base);
>>  set_trbe_base_pointer(buf->trbe_base);
>>  buf->trbe_base = (unsigned long)vmap(pglist, nr_pages, VM_MAP, 
>> PAGE_KERNEL);
>>  if (IS_ERR((void *)buf->trbe_base)) {
>>  return ERR_PTR(buf->trbe_base);
>>  buf->trbe_limit = buf->trbe_base + nr_pages * PAGE_SIZE;
>>  buf->trbe_write = buf->trbe_base;
>>  vunmap((void *)buf->trbe_base);
>>  base = get_trbe_base_pointer();
>>  buf->trbe_write = buf->trbe_base + PERF_IDX2OFF(handle->head, buf);
>>  if (buf->trbe_limit == buf->trbe_base) {
>>  buf->trbe_write = buf->trbe_base + PERF_IDX2OFF(handle->head, buf);
>>  if (buf->trbe_limit == buf->trbe_base) {
>>  offset = get_trbe_limit_pointer() - get_trbe_base_pointer();
>>  buf->trbe_write = buf->trbe_base + PERF_IDX2OFF(handle->head, buf);
>>  if (buf->trbe_limit == buf->trbe_base) {
>>  WARN_ON(buf->trbe_base != get_trbe_base_pointer());
>>  if (get_trbe_write_pointer() == get_trbe_base_pointer())
>>   
>>> casting?  And IS_ERR() doesn't work with vmap().
>>
>> Sure, will drop IS_ERR() here.
>>
> 
> [...]
> 
> 
>>>
 +
 +static ssize_t dbm_show(struct device *dev, struct device_attribute 
 *attr, char *buf)
 +{
 +  struct trbe_cpudata *cpudata = dev_get_drvdata(dev);
 +
 +  return sprintf(buf, "%d\n", cpudata->trbe_dbm);
 +}
 +static DEVICE_ATTR_RO(dbm);
>>>
>>> What does "dbm" stand for?  Looking at the documentation for TRBIDR_EL1.F, I
>>> don't see what "dbm" relates to.
>>
>> I made it up to refer TRBIDR_EL1.F as "Dirty (and Access Flag) Bit 
>> Management".
>> Could change it as "afdbm" to be more specific or if it is preferred.
>>
> 
> I don't see "afdbm" being a better solution - why not simply "flag"?

Replaced all reference for "dbm" with "flag".

Re: [PATCH V3 11/14] coresight: sink: Add TRBE driver

2021-02-15 Thread Anshuman Khandual



On 2/13/21 1:56 AM, Mathieu Poirier wrote:
> On Wed, Jan 27, 2021 at 02:25:35PM +0530, Anshuman Khandual wrote:
>> Trace Buffer Extension (TRBE) implements a trace buffer per CPU which is
>> accessible via the system registers. The TRBE supports different addressing
>> modes including CPU virtual address and buffer modes including the circular
>> buffer mode. The TRBE buffer is addressed by a base pointer (TRBBASER_EL1),
>> an write pointer (TRBPTR_EL1) and a limit pointer (TRBLIMITR_EL1). But the
>> access to the trace buffer could be prohibited by a higher exception level
>> (EL3 or EL2), indicated by TRBIDR_EL1.P. The TRBE can also generate a CPU
>> private interrupt (PPI) on address translation errors and when the buffer
>> is full. Overall implementation here is inspired from the Arm SPE driver.
>>
>> Cc: Mathieu Poirier 
>> Cc: Mike Leach 
>> Cc: Suzuki K Poulose 
>> Signed-off-by: Anshuman Khandual 
>> ---
>> Changes in V3:
>>
>> - Added new DT bindings document TRBE.yaml
>> - Changed TRBLIMITR_TRIG_MODE_SHIFT from 2 to 3
>> - Dropped isb() from trbe_reset_local()
>> - Dropped gap between (void *) and buf->trbe_base
>> - Changed 'int' to 'unsigned int' in is_trbe_available()
>> - Dropped unused function set_trbe_running(), set_trbe_virtual_mode(),
>>   set_trbe_enabled() and set_trbe_limit_pointer()
>> - Changed get_trbe_flag_update(), is_trbe_programmable() and
>>   get_trbe_address_align() to accept TRBIDR value
>> - Changed is_trbe_running(), is_trbe_abort(), is_trbe_wrap(), is_trbe_trg(),
>>   is_trbe_irq(), get_trbe_bsc() and get_trbe_ec() to accept TRBSR value
>> - Dropped snapshot mode condition in arm_trbe_alloc_buffer()
>> - Exit arm_trbe_init() when arm64_kernel_unmapped_at_el0() is enabled
>> - Compute trbe_limit before trbe_write to get the updated handle
>> - Added trbe_stop_and_truncate_event()
>> - Dropped trbe_handle_fatal()
>>
>>  Documentation/trace/coresight/coresight-trbe.rst |   39 +
>>  arch/arm64/include/asm/sysreg.h  |1 +
>>  drivers/hwtracing/coresight/Kconfig  |   11 +
>>  drivers/hwtracing/coresight/Makefile |1 +
>>  drivers/hwtracing/coresight/coresight-trbe.c | 1023 
>> ++
>>  drivers/hwtracing/coresight/coresight-trbe.h |  160 
>>  6 files changed, 1235 insertions(+)
>>  create mode 100644 Documentation/trace/coresight/coresight-trbe.rst
>>  create mode 100644 drivers/hwtracing/coresight/coresight-trbe.c
>>  create mode 100644 drivers/hwtracing/coresight/coresight-trbe.h
>>
>> diff --git a/Documentation/trace/coresight/coresight-trbe.rst 
>> b/Documentation/trace/coresight/coresight-trbe.rst
>> new file mode 100644
>> index 000..1cbb819
>> --- /dev/null
>> +++ b/Documentation/trace/coresight/coresight-trbe.rst
>> @@ -0,0 +1,39 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +==
>> +Trace Buffer Extension (TRBE).
>> +==
>> +
>> +:Author:   Anshuman Khandual 
>> +:Date: November 2020
>> +
>> +Hardware Description
>> +
>> +
>> +Trace Buffer Extension (TRBE) is a percpu hardware which captures in system
>> +memory, CPU traces generated from a corresponding percpu tracing unit. This
>> +gets plugged in as a coresight sink device because the corresponding trace
>> +genarators (ETE), are plugged in as source device.
>> +
>> +The TRBE is not compliant to CoreSight architecture specifications, but is
>> +driven via the CoreSight driver framework to support the ETE (which is
>> +CoreSight compliant) integration.
>> +
>> +Sysfs files and directories
>> +---
>> +
>> +The TRBE devices appear on the existing coresight bus alongside the other
>> +coresight devices::
>> +
>> +>$ ls /sys/bus/coresight/devices
>> +trbe0  trbe1  trbe2 trbe3
>> +
>> +The ``trbe`` named TRBEs are associated with a CPU.::
>> +
>> +>$ ls /sys/bus/coresight/devices/trbe0/
>> +align dbm
>> +
>> +*Key file items are:-*
>> +   * ``align``: TRBE write pointer alignment
>> +   * ``dbm``: TRBE updates memory with access and dirty flags
>> +
>> diff --git a/arch/arm64/include/asm/sysreg.h 
>> b/arch/arm64/include/asm/sysreg.h
>> index 85ae4db..9e2e9b7 100644
>> --- a/arch/arm64/include/asm/sysreg.h
>> +++ b/arch/arm64/include/asm/sysreg.h
>> @@ -97,6 +97,7 @@
>>  #define SET_PSTATE_

Re: [PATCH v2 1/1] arm64: mm: correct the inside linear map boundaries during hotplug check

2021-02-15 Thread Anshuman Khandual




On 2/16/21 12:57 AM, Ard Biesheuvel wrote:
> On Mon, 15 Feb 2021 at 20:22, Pavel Tatashin  
> wrote:
>>
>> Memory hotplug may fail on systems with CONFIG_RANDOMIZE_BASE because the
>> linear map range is not checked correctly.
>>
>> The start physical address that linear map covers can be actually at the
>> end of the range because of randomization. Check that and if so reduce it
>> to 0.
>>
>> This can be verified on QEMU with setting kaslr-seed to ~0ul:
>>
>> memstart_offset_seed = 0x
>> START: __pa(_PAGE_OFFSET(vabits_actual)) = 9000c000
>> END:   __pa(PAGE_END - 1) =  1000bfff
>>
>> Signed-off-by: Pavel Tatashin 
>> Fixes: 58284a901b42 ("arm64/mm: Validate hotplug range before creating 
>> linear mapping")
>> Tested-by: Tyler Hicks 
> 
>> ---
>>  arch/arm64/mm/mmu.c | 20 ++--
>>  1 file changed, 18 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index ae0c3d023824..cc16443ea67f 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1444,14 +1444,30 @@ static void __remove_pgd_mapping(pgd_t *pgdir, 
>> unsigned long start, u64 size)
>>
>>  static bool inside_linear_region(u64 start, u64 size)
>>  {
>> +   u64 start_linear_pa = __pa(_PAGE_OFFSET(vabits_actual));
>> +   u64 end_linear_pa = __pa(PAGE_END - 1);
>> +
>> +   if (IS_ENABLED(CONFIG_RANDOMIZE_BASE)) {
>> +   /*
>> +* Check for a wrap, it is possible because of randomized 
>> linear
>> +* mapping the start physical address is actually bigger than
>> +* the end physical address. In this case set start to zero
>> +* because [0, end_linear_pa] range must still be able to 
>> cover
>> +* all addressable physical addresses.
>> +*/
>> +   if (start_linear_pa > end_linear_pa)
>> +   start_linear_pa = 0;
>> +   }
>> +
>> +   WARN_ON(start_linear_pa > end_linear_pa);
>> +
>> /*
>>  * Linear mapping region is the range [PAGE_OFFSET..(PAGE_END - 1)]
>>  * accommodating both its ends but excluding PAGE_END. Max physical
>>  * range which can be mapped inside this linear mapping range, must
>>  * also be derived from its end points.
>>  */
>> -   return start >= __pa(_PAGE_OFFSET(vabits_actual)) &&
>> -  (start + size - 1) <= __pa(PAGE_END - 1);
> 
> Can't we simply use signed arithmetic here? This expression works fine
> if the quantities are all interpreted as s64 instead of u64

There is a new generic framework which expects the platform to provide two
distinct range points (low and high) for hotplug address comparison. Those
range points can be different depending on whether address randomization
is enabled and the flip occurs. But this comparison here in the platform
code is going away.

This patch needs to rebased on the new framework which is part of linux-next.

https://patchwork.kernel.org/project/linux-mm/list/?series=425051

Re: [PATCH v2 1/1] arm64: mm: correct the inside linear map boundaries during hotplug check

2021-02-15 Thread Anshuman Khandual




On 2/16/21 1:21 AM, Pavel Tatashin wrote:
> On Mon, Feb 15, 2021 at 2:34 PM Ard Biesheuvel  wrote:
>>
>> On Mon, 15 Feb 2021 at 20:30, Pavel Tatashin  
>> wrote:
>>>
 Can't we simply use signed arithmetic here? This expression works fine
 if the quantities are all interpreted as s64 instead of u64
>>>
>>> I was thinking about that, but I do not like the idea of using sign
>>> arithmetics for physical addresses. Also, I am worried that someone in
>>> the future will unknowingly change it to unsigns or to phys_addr_t. It
>>> is safer to have start explicitly set to 0 in case of wrap.
>>
>> memstart_addr is already a s64 for this exact reason.
> 
> memstart_addr is basically an offset and it must be negative. For
> example, this would not work if it was not signed:
> #define vmemmap ((struct page *)VMEMMAP_START - (memstart_addr >> PAGE_SHIFT))
> 
> However, on powerpc it is phys_addr_t type.
> 
>>
>> Btw, the KASLR check is incorrect: memstart_addr could also be
>> negative when running the 52-bit VA kernel on hardware that is only
>> 48-bit VA capable.
> 
> Good point!
> 
> if (IS_ENABLED(CONFIG_ARM64_VA_BITS_52) && (vabits_actual != 52))
> memstart_addr -= _PAGE_OFFSET(48) - _PAGE_OFFSET(52);
> 
> So, I will remove IS_ENABLED(CONFIG_RANDOMIZE_BASE) again.
> 
> I am OK to change start_linear_pa, end_linear_pa to signed, but IMO
> what I have now is actually safer to make sure that does not break
> again in the future.
An explicit check for the flip over and providing two different start
addresses points would be required in order to use the new framework.

Re: [PATCH 0/3] mm/page_alloc: Fix pageblock_order with HUGETLB_PAGE_SIZE_VARIABLE

2021-02-16 Thread Anshuman Khandual



On 2/12/21 3:09 PM, David Hildenbrand wrote:
> On 12.02.21 08:02, Anshuman Khandual wrote:
>>
>> On 2/11/21 2:07 PM, David Hildenbrand wrote:
>>> On 11.02.21 07:22, Anshuman Khandual wrote:
>>>> The following warning gets triggered while trying to boot a 64K page size
>>>> without THP config kernel on arm64 platform.
>>>>
>>>> WARNING: CPU: 5 PID: 124 at mm/vmstat.c:1080 
>>>> __fragmentation_index+0xa4/0xc0
>>>> Modules linked in:
>>>> CPU: 5 PID: 124 Comm: kswapd0 Not tainted 5.11.0-rc6-4-ga0ea7d62002 
>>>> #159
>>>> Hardware name: linux,dummy-virt (DT)
>>>> [    8.810673] pstate: 2045 (nzCv daif +PAN -UAO -TCO BTYPE=--)
>>>> [    8.811732] pc : __fragmentation_index+0xa4/0xc0
>>>> [    8.812555] lr : fragmentation_index+0xf8/0x138
>>>> [    8.813360] sp : 864079b0
>>>> [    8.813958] x29: 864079b0 x28: 0372
>>>> [    8.814901] x27: 7682 x26: 8000135b3948
>>>> [    8.815847] x25: 1fffe00010c80f48 x24: 
>>>> [    8.816805] x23:  x22: 000d
>>>> [    8.817764] x21: 0030 x20: 0005ffcb4d58
>>>> [    8.818712] x19: 000b x18: 
>>>> [    8.819656] x17:  x16: 
>>>> [    8.820613] x15:  x14: 8000114c6258
>>>> [    8.821560] x13: 6000bff969ba x12: 1fffe000bff969b9
>>>> [    8.822514] x11: 1fffe000bff969b9 x10: 6000bff969b9
>>>> [    8.823461] x9 : dfff8000 x8 : 0005ffcb4dcf
>>>> [    8.824415] x7 : 0001 x6 : 41b58ab3
>>>> [    8.825359] x5 : 600010c80f48 x4 : dfff8000
>>>> [    8.826313] x3 : 8000102be670 x2 : 0007
>>>> [    8.827259] x1 : 86407a60 x0 : 000d
>>>> [    8.828218] Call trace:
>>>> [    8.828667]  __fragmentation_index+0xa4/0xc0
>>>> [    8.829436]  fragmentation_index+0xf8/0x138
>>>> [    8.830194]  compaction_suitable+0x98/0xb8
>>>> [    8.830934]  wakeup_kcompactd+0xdc/0x128
>>>> [    8.831640]  balance_pgdat+0x71c/0x7a0
>>>> [    8.832327]  kswapd+0x31c/0x520
>>>> [    8.832902]  kthread+0x224/0x230
>>>> [    8.833491]  ret_from_fork+0x10/0x30
>>>> [    8.834150] ---[ end trace 472836f79c15516b ]---
>>>>
>>>> This warning comes from __fragmentation_index() when the requested order
>>>> is greater than MAX_ORDER.
>>>>
>>>> static int __fragmentation_index(unsigned int order,
>>>>   struct contig_page_info *info)
>>>> {
>>>>   unsigned long requested = 1UL << order;
>>>>
>>>>   if (WARN_ON_ONCE(order >= MAX_ORDER)) <= Triggered here
>>>>   return 0;
>>>>
>>>> Digging it further reveals that pageblock_order has been assigned a value
>>>> which is greater than MAX_ORDER failing the above check. But why this
>>>> happened ? Because HUGETLB_PAGE_ORDER for the given config on arm64 is
>>>> greater than MAX_ORDER.
>>>>
>>>> The solution involves enabling HUGETLB_PAGE_SIZE_VARIABLE which would make
>>>> pageblock_order a variable instead of constant HUGETLB_PAGE_ORDER. But that
>>>> change alone also did not really work as pageblock_order still got assigned
>>>> as HUGETLB_PAGE_ORDER in set_pageblock_order(). HUGETLB_PAGE_ORDER needs to
>>>> be less than MAX_ORDER for its appropriateness as pageblock_order otherwise
>>>> just fallback to MAX_ORDER - 1 as before. While here it also fixes a build
>>>> problem via type casting MAX_ORDER in rmem_cma_setup().
>>>
>>> I'm wondering, is there any real value in allowing FORCE_MAX_ZONEORDER to 
>>> be "11" with ARM64_64K_PAGES/ARM64_16K_PAGES?
>>
>> MAX_ORDER should be as high as would be required for the current config.
>> Unless THP is enabled, there is no need for it to be any higher than 11.
>> But I might be missing historical reasons around this as well. Probably
>> others from arm64 could help here.
> 
> Theoretically yes, practically no. If nobody cares about a configuration, no 
> need to make the code more complicated for that configuration.
> 
>>
>>>
>>> Meaning: are there any real use cases that actually build a kernel without 
>>> T

Re: [PATCH V3 11/14] coresight: sink: Add TRBE driver

2021-02-16 Thread Anshuman Khandual

Hello Mike,

On 2/16/21 2:30 PM, Mike Leach wrote:
> Hi Anshuman,
> 
> There have been plenty of detailed comments so I will restrict mine to
> a few general issues:-
> 
> 1) Currently there appears to be no sysfs support (I cannot see the
> MODE_SYSFS constants running alongside the MODE_PERF ones present in
> the other sink drivers). This is present on all other coresight
> devices, and must be provided for this device. It is useful for
> testing, and there are users out there who will have scripts to use
> it. It is not essential it makes it into this set, but should be a
> follow up set.

Sure, will try and add it in a follow up series.

> 
> 2) Using FILL mode for TRBE means that the trace will by definition be
> lossy. Fill mode will halt collection without cleanly stopping and
> flushing the source. This will result in the sink missing the last of
> the data from the source as it stops. Even if taking the exception
> moves into a prohibited region there is still the possibility the last
> trace operations will not be seen. Further it is possible that the
> last few bytes of trace will be an incomplete packet, and indeed the
> start of the next buffer could contain incomplete packets too.

Just wondering why TRBE and ETE would not sync with each other in order
for the ETE to possibly resend all the lost trace data, when the TRBE
runs out of buffer and wrappers around ? Is this ETE/TRBE behavior same
for all implementations in the FILL mode ? Just wondering.

> 
> This operation differs from the other sinks which will only halt after
> the sources have stopped and the path has been flushed. This ensures
> that the latest trace is complete. The weakness with the older sinks
> is the lack of interrupt meaning buffers were frequently wrapped so
> that only the latest trace is available.

Right.

> 
> By using TRBE WRAP mode, with a watermark as described in the TRBE
> spec, using the interrupts it is possible to approach lossless trace
> in a way that is not possible with earlier ETR/ETB. This is somethin
Using TRBTRG_EL1 as the above mentioned watermark ?

> that has been requested by partners since trace became available in
> linux systems. (There is still a possibility of loss due to filling
> the buffer completely and overflowing the watermark, but that can be
> flagged).
> 
> While FILL mode trace is a good start, and suitable for some scenarios
> - WRAP mode needs implementing as well.

I would like to understand this mechanism more. Besides how the perf
interface suppose to choose between FILL and WRAP mode ? via a new
event attribute ?

> 
> 3) Padding: To be clear, it is not safe for the decoder to run off the
> end of one buffer, into the padding area and continue decoding, or
> continue through the padding into the next buffer. However I believe
> the buffer start / stop points are demarked by the aux_output_start /
> aux_output_end calls?

Yes.

> 
> With upcoming perf decode updates this should enable the decoder to
> correctly be started and stopped on the buffer boundaries. The padding
> is there primarily to ensure that the decoder does not synchronize
> with the data stream until a genuine sync point is found.

Right.

> 
> 4) TRBE needs to be a loadable module like the rest of coresight.

Even though the driver has all the module constructs, the Kconfig was
missing a tristate value, which is being fixed for the next version.

- Anshuman

Re: [PATCH V3 08/14] coresight: core: Add support for dedicated percpu sinks

2021-02-16 Thread Anshuman Khandual




On 1/28/21 2:46 PM, Suzuki K Poulose wrote:
> On 1/27/21 8:55 AM, Anshuman Khandual wrote:
>> Add support for dedicated sinks that are bound to individual CPUs. (e.g,
>> TRBE). To allow quicker access to the sink for a given CPU bound source,
>> keep a percpu array of the sink devices. Also, add support for building
>> a path to the CPU local sink from the ETM.
>>
>> This adds a new percpu sink type CORESIGHT_DEV_SUBTYPE_SINK_PERCPU_SYSMEM.
>> This new sink type is exclusively available and can only work with percpu
>> source type device CORESIGHT_DEV_SUBTYPE_SOURCE_PERCPU_PROC.
>>
>> This defines a percpu structure that accommodates a single coresight_device
>> which can be used to store an initialized instance from a sink driver. As
>> these sinks are exclusively linked and dependent on corresponding percpu
>> sources devices, they should also be the default sink device during a perf
>> session.
>>
>> Outwards device connections are scanned while establishing paths between a
>> source and a sink device. But such connections are not present for certain
>> percpu source and sink devices which are exclusively linked and dependent.
>> Build the path directly and skip connection scanning for such devices.
>>
>> Cc: Mathieu Poirier 
>> Cc: Mike Leach 
>> Cc: Suzuki K Poulose 
>> Signed-off-by: Anshuman Khandual 
>> ---
>> Changes in V3:
>>
>> - Updated coresight_find_default_sink()
>>
>>   drivers/hwtracing/coresight/coresight-core.c | 16 ++--
>>   include/linux/coresight.h    | 12 
>>   2 files changed, 26 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/hwtracing/coresight/coresight-core.c 
>> b/drivers/hwtracing/coresight/coresight-core.c
>> index 0062c89..4795e28 100644
>> --- a/drivers/hwtracing/coresight/coresight-core.c
>> +++ b/drivers/hwtracing/coresight/coresight-core.c
>> @@ -23,6 +23,7 @@
>>   #include "coresight-priv.h"
>>     static DEFINE_MUTEX(coresight_mutex);
>> +DEFINE_PER_CPU(struct coresight_device *, csdev_sink);
>>     /**
>>    * struct coresight_node - elements of a path, from source to sink
>> @@ -784,6 +785,13 @@ static int _coresight_build_path(struct 
>> coresight_device *csdev,
>>   if (csdev == sink)
>>   goto out;
>>   +    if (coresight_is_percpu_source(csdev) && 
>> coresight_is_percpu_sink(sink) &&
>> +    sink == per_cpu(csdev_sink, source_ops(csdev)->cpu_id(csdev))) {
>> +    _coresight_build_path(sink, sink, path);
>> +    found = true;
>> +    goto out;
>> +    }
>> +
>>   /* Not a sink - recursively explore each port found on this element */
>>   for (i = 0; i < csdev->pdata->nr_outport; i++) {
>>   struct coresight_device *child_dev;
>> @@ -999,8 +1007,12 @@ coresight_find_default_sink(struct coresight_device 
>> *csdev)
>>   int depth = 0;
>>     /* look for a default sink if we have not found for this device */
>> -    if (!csdev->def_sink)
>> -    csdev->def_sink = coresight_find_sink(csdev, &depth);
>> +    if (!csdev->def_sink) {
>> +    if (coresight_is_percpu_source(csdev))
>> +    csdev->def_sink = per_cpu(csdev_sink, 
>> source_ops(csdev)->cpu_id(csdev));
>> +    if (!csdev->def_sink)
>> +    csdev->def_sink = coresight_find_sink(csdev, &depth);
>> +    }
>>   return csdev->def_sink;
>>   }
>>   diff --git a/include/linux/coresight.h b/include/linux/coresight.h
>> index 976ec26..bc3a5ca 100644
>> --- a/include/linux/coresight.h
>> +++ b/include/linux/coresight.h
>> @@ -50,6 +50,7 @@ enum coresight_dev_subtype_sink {
>>   CORESIGHT_DEV_SUBTYPE_SINK_PORT,
>>   CORESIGHT_DEV_SUBTYPE_SINK_BUFFER,
>>   CORESIGHT_DEV_SUBTYPE_SINK_SYSMEM,
>> +    CORESIGHT_DEV_SUBTYPE_SINK_PERCPU_SYSMEM,
>>   };
>>     enum coresight_dev_subtype_link {
>> @@ -428,6 +429,17 @@ static inline void csdev_access_write64(struct 
>> csdev_access *csa, u64 val, u32 o
>>   csa->write(val, offset, false, true);
>>   }
>>   +static inline bool coresight_is_percpu_source(struct coresight_device 
>> *csdev)
>> +{
>> +    return csdev && (csdev->type == CORESIGHT_DEV_TYPE_SOURCE) &&
>> +   csdev->subtype.source_subtype == 
>> CORESIGHT_DEV_SUBTYPE_SOURCE_PROC;
> 
> Please add () around the last line. Same below.

Okay, will do.

> 
>> +}
>> +
>> +static inline bool coresight_is_percpu_sink(struct coresight_device *csdev)
>> +{
>> +    return csdev && (csdev->type == CORESIGHT_DEV_TYPE_SINK) &&
>> +   csdev->subtype.sink_subtype == 
>> CORESIGHT_DEV_SUBTYPE_SINK_PERCPU_SYSMEM;

Okay, will add here as well.

>> +}
>>   #else    /* !CONFIG_64BIT */
>>     static inline u64 csdev_access_relaxed_read64(struct csdev_access *csa,
>>
> 
> With the above :
> 
> Tested-by: Suzuki K Poulose 
> Reviewed-by: Suzuki K Poulose

Re: [PATCH V3 08/14] coresight: core: Add support for dedicated percpu sinks

2021-02-16 Thread Anshuman Khandual




On 2/5/21 12:04 AM, Mathieu Poirier wrote:
> On Thu, Jan 28, 2021 at 09:16:34AM +, Suzuki K Poulose wrote:
>> On 1/27/21 8:55 AM, Anshuman Khandual wrote:
>>> Add support for dedicated sinks that are bound to individual CPUs. (e.g,
>>> TRBE). To allow quicker access to the sink for a given CPU bound source,
>>> keep a percpu array of the sink devices. Also, add support for building
>>> a path to the CPU local sink from the ETM.
>>>
>>> This adds a new percpu sink type CORESIGHT_DEV_SUBTYPE_SINK_PERCPU_SYSMEM.
>>> This new sink type is exclusively available and can only work with percpu
>>> source type device CORESIGHT_DEV_SUBTYPE_SOURCE_PERCPU_PROC.
>>>
>>> This defines a percpu structure that accommodates a single coresight_device
>>> which can be used to store an initialized instance from a sink driver. As
>>> these sinks are exclusively linked and dependent on corresponding percpu
>>> sources devices, they should also be the default sink device during a perf
>>> session.
>>>
>>> Outwards device connections are scanned while establishing paths between a
>>> source and a sink device. But such connections are not present for certain
>>> percpu source and sink devices which are exclusively linked and dependent.
>>> Build the path directly and skip connection scanning for such devices.
>>>
>>> Cc: Mathieu Poirier 
>>> Cc: Mike Leach 
>>> Cc: Suzuki K Poulose 
>>> Signed-off-by: Anshuman Khandual 
>>> ---
>>> Changes in V3:
>>>
>>> - Updated coresight_find_default_sink()
>>>
>>>   drivers/hwtracing/coresight/coresight-core.c | 16 ++--
>>>   include/linux/coresight.h| 12 
>>>   2 files changed, 26 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/hwtracing/coresight/coresight-core.c 
>>> b/drivers/hwtracing/coresight/coresight-core.c
>>> index 0062c89..4795e28 100644
>>> --- a/drivers/hwtracing/coresight/coresight-core.c
>>> +++ b/drivers/hwtracing/coresight/coresight-core.c
>>> @@ -23,6 +23,7 @@
>>>   #include "coresight-priv.h"
>>>   static DEFINE_MUTEX(coresight_mutex);
>>> +DEFINE_PER_CPU(struct coresight_device *, csdev_sink);
>>>   /**
>>>* struct coresight_node - elements of a path, from source to sink
>>> @@ -784,6 +785,13 @@ static int _coresight_build_path(struct 
>>> coresight_device *csdev,
>>> if (csdev == sink)
>>> goto out;
>>> +   if (coresight_is_percpu_source(csdev) && coresight_is_percpu_sink(sink) 
>>> &&
>>> +   sink == per_cpu(csdev_sink, source_ops(csdev)->cpu_id(csdev))) {
>>> +   _coresight_build_path(sink, sink, path);
> 
> The return value for _coresight_build_path() needs to be checked.  Otherwise a
> failure to allocate a node for the sink will go unoticed and make for a very
> hard problem to debug.

How about this instead ?

diff --git a/drivers/hwtracing/coresight/coresight-core.c 
b/drivers/hwtracing/coresight/coresight-core.c
index 4795e28..e93e669 100644
--- a/drivers/hwtracing/coresight/coresight-core.c
+++ b/drivers/hwtracing/coresight/coresight-core.c
@@ -787,9 +787,10 @@ static int _coresight_build_path(struct coresight_device 
*csdev,
 
if (coresight_is_percpu_source(csdev) && coresight_is_percpu_sink(sink) 
&&
sink == per_cpu(csdev_sink, source_ops(csdev)->cpu_id(csdev))) {
-   _coresight_build_path(sink, sink, path);
-   found = true;
-   goto out;
+   if (_coresight_build_path(sink, sink, path) == 0) {
+   found = true;
+   goto out;
+   }
}
 
/* Not a sink - recursively explore each port found on this element */

> 
>>> +   found = true;
>>> +   goto out;
>>> +   }
>>> +
>>> /* Not a sink - recursively explore each port found on this element */
>>> for (i = 0; i < csdev->pdata->nr_outport; i++) {
>>> struct coresight_device *child_dev;
>>> @@ -999,8 +1007,12 @@ coresight_find_default_sink(struct coresight_device 
>>> *csdev)
>>> int depth = 0;
>>> /* look for a default sink if we have not found for this device */
>>> -   if (!csdev->def_sink)
>>> -   csdev->def_sink = coresight_find_sink(csdev, &depth);
>>> +   if (!csdev->def_sink) {
>>> +   if (coresight_is_percpu_source(csdev))
>>> +

Re: [RFC/RFT PATCH 3/3] arm64: drop pfn_valid_within() and simplify pfn_valid()

2021-04-07 Thread Anshuman Khandual



On 4/7/21 10:56 PM, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> The arm64's version of pfn_valid() differs from the generic because of two
> reasons:
> 
> * Parts of the memory map are freed during boot. This makes it necessary to
>   verify that there is actual physical memory that corresponds to a pfn
>   which is done by querying memblock.
> 
> * There are NOMAP memory regions. These regions are not mapped in the
>   linear map and until the previous commit the struct pages representing
>   these areas had default values.
> 
> As the consequence of absence of the special treatment of NOMAP regions in
> the memory map it was necessary to use memblock_is_map_memory() in
> pfn_valid() and to have pfn_valid_within() aliased to pfn_valid() so that
> generic mm functionality would not treat a NOMAP page as a normal page.
> 
> Since the NOMAP regions are now marked as PageReserved(), pfn walkers and
> the rest of core mm will treat them as unusable memory and thus
> pfn_valid_within() is no longer required at all and can be disabled by
> removing CONFIG_HOLES_IN_ZONE on arm64.

But what about the memory map that are freed during boot (mentioned above).
Would not they still cause CONFIG_HOLES_IN_ZONE to be applicable and hence
pfn_valid_within() ?

> 
> pfn_valid() can be slightly simplified by replacing
> memblock_is_map_memory() with memblock_is_memory().

Just to understand this better, pfn_valid() will now return true for all
MEMBLOCK_NOMAP based memory but that is okay as core MM would still ignore
them as unusable memory for being PageReserved().

> 
> Signed-off-by: Mike Rapoport 
> ---
>  arch/arm64/Kconfig   | 3 ---
>  arch/arm64/mm/init.c | 4 ++--
>  2 files changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index e4e1b6550115..58e439046d05 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1040,9 +1040,6 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
>   def_bool y
>   depends on NUMA
>  
> -config HOLES_IN_ZONE
> - def_bool y
> -
>  source "kernel/Kconfig.hz"
>  
>  config ARCH_SPARSEMEM_ENABLE
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 258b1905ed4a..bb6dd406b1f0 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -243,7 +243,7 @@ int pfn_valid(unsigned long pfn)
>  
>   /*
>* ZONE_DEVICE memory does not have the memblock entries.
> -  * memblock_is_map_memory() check for ZONE_DEVICE based
> +  * memblock_is_memory() check for ZONE_DEVICE based
>* addresses will always fail. Even the normal hotplugged
>* memory will never have MEMBLOCK_NOMAP flag set in their
>* memblock entries. Skip memblock search for all non early
> @@ -254,7 +254,7 @@ int pfn_valid(unsigned long pfn)
>   return pfn_section_valid(ms, pfn);
>  }
>  #endif
> - return memblock_is_map_memory(addr);
> + return memblock_is_memory(addr);
>  }
>  EXPORT_SYMBOL(pfn_valid);
>  
>

Re: [RFC/RFT PATCH 2/3] arm64: decouple check whether pfn is normal memory from pfn_valid()

2021-04-07 Thread Anshuman Khandual



On 4/7/21 10:56 PM, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> The intended semantics of pfn_valid() is to verify whether there is a
> struct page for the pfn in question and nothing else.

Should there be a comment affirming this semantics interpretation, above the
generic pfn_valid() in include/linux/mmzone.h ?

> 
> Yet, on arm64 it is used to distinguish memory areas that are mapped in the
> linear map vs those that require ioremap() to access them.
> 
> Introduce a dedicated pfn_is_memory() to perform such check and use it
> where appropriate.
> 
> Signed-off-by: Mike Rapoport 
> ---
>  arch/arm64/include/asm/memory.h | 2 +-
>  arch/arm64/include/asm/page.h   | 1 +
>  arch/arm64/kvm/mmu.c| 2 +-
>  arch/arm64/mm/init.c| 6 ++
>  arch/arm64/mm/ioremap.c | 4 ++--
>  arch/arm64/mm/mmu.c | 2 +-
>  6 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index 0aabc3be9a75..7e77fdf71b9d 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -351,7 +351,7 @@ static inline void *phys_to_virt(phys_addr_t x)
>  
>  #define virt_addr_valid(addr)({  
> \
>   __typeof__(addr) __addr = __tag_reset(addr);\
> - __is_lm_address(__addr) && pfn_valid(virt_to_pfn(__addr));  \
> + __is_lm_address(__addr) && pfn_is_memory(virt_to_pfn(__addr));  \
>  })
>  
>  void dump_mem_limit(void);
> diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> index 012cffc574e8..32b485bcc6ff 100644
> --- a/arch/arm64/include/asm/page.h
> +++ b/arch/arm64/include/asm/page.h
> @@ -38,6 +38,7 @@ void copy_highpage(struct page *to, struct page *from);
>  typedef struct page *pgtable_t;
>  
>  extern int pfn_valid(unsigned long);
> +extern int pfn_is_memory(unsigned long);
>  
>  #include 
>  
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 8711894db8c2..ad2ea65a3937 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -85,7 +85,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
>  
>  static bool kvm_is_device_pfn(unsigned long pfn)
>  {
> - return !pfn_valid(pfn);
> + return !pfn_is_memory(pfn);
>  }
>  
>  /*
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 3685e12aba9b..258b1905ed4a 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -258,6 +258,12 @@ int pfn_valid(unsigned long pfn)
>  }
>  EXPORT_SYMBOL(pfn_valid);
>  
> +int pfn_is_memory(unsigned long pfn)
> +{
> + return memblock_is_map_memory(PFN_PHYS(pfn));
> +}
> +EXPORT_SYMBOL(pfn_is_memory);> +

Should not this be generic though ? There is nothing platform or arm64
specific in here. Wondering as pfn_is_memory() just indicates that the
pfn is linear mapped, should not it be renamed as pfn_is_linear_memory()
instead ? Regardless, it's fine either way.

>  static phys_addr_t memory_limit = PHYS_ADDR_MAX;
>  
>  /*
> diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c
> index b5e83c46b23e..82a369b22ef5 100644
> --- a/arch/arm64/mm/ioremap.c
> +++ b/arch/arm64/mm/ioremap.c
> @@ -43,7 +43,7 @@ static void __iomem *__ioremap_caller(phys_addr_t 
> phys_addr, size_t size,
>   /*
>* Don't allow RAM to be mapped.
>*/
> - if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr
> + if (WARN_ON(pfn_is_memory(__phys_to_pfn(phys_addr
>   return NULL;
>  
>   area = get_vm_area_caller(size, VM_IOREMAP, caller);
> @@ -84,7 +84,7 @@ EXPORT_SYMBOL(iounmap);
>  void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size)
>  {
>   /* For normal memory we already have a cacheable mapping. */
> - if (pfn_valid(__phys_to_pfn(phys_addr)))
> + if (pfn_is_memory(__phys_to_pfn(phys_addr)))
>   return (void __iomem *)__phys_to_virt(phys_addr);
>  
>   return __ioremap_caller(phys_addr, size, __pgprot(PROT_NORMAL),
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 5d9550fdb9cf..038d20fe163f 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -81,7 +81,7 @@ void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd)
>  pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
> unsigned long size, pgprot_t vma_prot)
>  {
> - if (!pfn_valid(pfn))
> + if (!pfn_is_memory(pfn))
>   return pgprot_noncached(vma_prot);
>   else if (file->f_flags & O_SYNC)
>   return pgprot_writecombine(vma_prot);
>

Re: [RFC/RFT PATCH 1/3] memblock: update initialization of reserved pages

2021-04-07 Thread Anshuman Khandual




On 4/7/21 10:56 PM, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> The struct pages representing a reserved memory region are initialized
> using reserve_bootmem_range() function. This function is called for each
> reserved region just before the memory is freed from memblock to the buddy
> page allocator.
> 
> The struct pages for MEMBLOCK_NOMAP regions are kept with the default
> values set by the memory map initialization which makes it necessary to
> have a special treatment for such pages in pfn_valid() and
> pfn_valid_within().
> 
> Split out initialization of the reserved pages to a function with a
> meaningful name and treat the MEMBLOCK_NOMAP regions the same way as the
> reserved regions and mark struct pages for the NOMAP regions as
> PageReserved.

This would definitely need updating the comment for MEMBLOCK_NOMAP definition
in include/linux/memblock.h just to make the semantics is clear, though arm64
is currently the only user for MEMBLOCK_NOMAP.

> 
> Signed-off-by: Mike Rapoport 
> ---
>  mm/memblock.c | 23 +--
>  1 file changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index afaefa8fc6ab..6b7ea9d86310 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -2002,6 +2002,26 @@ static unsigned long __init 
> __free_memory_core(phys_addr_t start,
>   return end_pfn - start_pfn;
>  }
>  
> +static void __init memmap_init_reserved_pages(void)
> +{
> + struct memblock_region *region;
> + phys_addr_t start, end;
> + u64 i;
> +
> + /* initialize struct pages for the reserved regions */
> + for_each_reserved_mem_range(i, &start, &end)
> + reserve_bootmem_region(start, end);
> +
> + /* and also treat struct pages for the NOMAP regions as PageReserved */
> + for_each_mem_region(region) {
> + if (memblock_is_nomap(region)) {
> + start = region->base;
> + end = start + region->size;
> + reserve_bootmem_region(start, end);
> + }
> + }
> +}
> +
>  static unsigned long __init free_low_memory_core_early(void)
>  {
>   unsigned long count = 0;
> @@ -2010,8 +2030,7 @@ static unsigned long __init 
> free_low_memory_core_early(void)
>  
>   memblock_clear_hotplug(0, -1);
>  
> - for_each_reserved_mem_range(i, &start, &end)
> - reserve_bootmem_region(start, end);
> + memmap_init_reserved_pages();
>  
>   /*
>* We need to use NUMA_NO_NODE instead of NODE_DATA(0)->node_id
>

Re: [RFC/RFT PATCH 0/3] arm64: drop pfn_valid_within() and simplify pfn_valid()

2021-04-07 Thread Anshuman Khandual

Adding James here.

+ James Morse 

On 4/7/21 10:56 PM, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> Hi,
> 
> These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire
> pfn_valid_within() to 1. 

That would be really great for arm64 platform as it will save CPU cycles on
many generic MM paths, given that our pfn_valid() has been expensive.

> 
> The idea is to mark NOMAP pages as reserved in the memory map and restore

Though I am not really sure, would that possibly be problematic for UEFI/EFI
use cases as it might have just treated them as normal struct pages till now.

> the intended semantics of pfn_valid() to designate availability of struct
> page for a pfn.

Right, that would be better as the current semantics is not ideal.

> 
> With this the core mm will be able to cope with the fact that it cannot use
> NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER blocks
> will be treated correctly even without the need for pfn_valid_within.
> 
> The patches are only boot tested on qemu-system-aarch64 so I'd really
> appreciate memory stress tests on real hardware.

Did some preliminary memory stress tests on a guest with portions of memory
marked as MEMBLOCK_NOMAP and did not find any obvious problem. But this might
require some testing on real UEFI environment with firmware using MEMBLOCK_NOMAP
memory to make sure that changing these struct pages to PageReserved() is safe.

> 
> If this actually works we'll be one step closer to drop custom pfn_valid()
> on arm64 altogether.

Right, planning to rework and respin the RFC originally sent last month.

https://patchwork.kernel.org/project/linux-mm/patch/1615174073-10520-1-git-send-email-anshuman.khand...@arm.com/

Re: [PATCH -next v2 2/2] mm/debug_vm_pgtable: Remove redundant pfn_{pmd/pte}() and fix one comment mistake

2021-04-08 Thread Anshuman Khandual



On 4/6/21 10:19 AM, Shixin Liu wrote:
> v1->v2:
> Remove redundant pfn_pte() and fold two patch to one.

Change log should always be after the '---' below the SOB statement for git
am to ignore them. Please avoid adding them in the commit messages.

> 
> Remove redundant pfn_{pmd/pte}() and fix one comment mistake.
> 
> Signed-off-by: Shixin Liu 
> ---
>  mm/debug_vm_pgtable.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index d3cf178621d9..e2f35db8ba69 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -91,7 +91,7 @@ static void __init pte_advanced_tests(struct mm_struct *mm,
> unsigned long pfn, unsigned long vaddr,
> pgprot_t prot)
>  {
> - pte_t pte = pfn_pte(pfn, prot);
> + pte_t pte;
>  

Right.

>   /*
>* Architectures optimize set_pte_at by avoiding TLB flush.
> @@ -185,7 +185,7 @@ static void __init pmd_advanced_tests(struct mm_struct 
> *mm,
> unsigned long pfn, unsigned long vaddr,
> pgprot_t prot, pgtable_t pgtable)
>  {
> - pmd_t pmd = pfn_pmd(pfn, prot);
> + pmd_t pmd;


Right.

>  
>   if (!has_transparent_hugepage())
>   return;
> @@ -300,7 +300,7 @@ static void __init pud_advanced_tests(struct mm_struct 
> *mm,
> unsigned long pfn, unsigned long vaddr,
> pgprot_t prot)
>  {
> - pud_t pud = pfn_pud(pfn, prot);
> + pud_t pud;
>  
>   if (!has_transparent_hugepage())
>   return;
> @@ -309,6 +309,7 @@ static void __init pud_advanced_tests(struct mm_struct 
> *mm,
>   /* Align the address wrt HPAGE_PUD_SIZE */
>   vaddr = (vaddr & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE;
>  
> + pud = pfn_pud(pfn, prot);

Is this change intended to make pud_advanced_tests() similar other
advanced tests ? Please update the commit message as well.

>   set_pud_at(mm, vaddr, pudp, pud);
>   pudp_set_wrprotect(mm, vaddr, pudp);
>   pud = READ_ONCE(*pudp);
> @@ -742,12 +743,12 @@ static void __init pmd_swap_soft_dirty_tests(unsigned 
> long pfn, pgprot_t prot)
>   WARN_ON(!pmd_swp_soft_dirty(pmd_swp_mksoft_dirty(pmd)));
>   WARN_ON(pmd_swp_soft_dirty(pmd_swp_clear_soft_dirty(pmd)));
>  }
> -#else  /* !CONFIG_ARCH_HAS_PTE_DEVMAP */
> +#else  /* !CONFIG_TRANSPARENT_HUGEPAGE */
>  static void __init pmd_soft_dirty_tests(unsigned long pfn, pgprot_t prot) { }
>  static void __init pmd_swap_soft_dirty_tests(unsigned long pfn, pgprot_t 
> prot)
>  {
>  }
> -#endif /* CONFIG_ARCH_HAS_PTE_DEVMAP */
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  static void __init pte_swap_tests(unsigned long pfn, pgprot_t prot)
>  {
> 
With changes to the commit message as suggested earlier.

Reviewed-by: Anshuman Khandual

Re: [PATCH -next v2 1/2] mm/debug_vm_pgtable: Move {pmd/pud}_huge_tests out of CONFIG_TRANSPARENT_HUGEPAGE

2021-04-08 Thread Anshuman Khandual

PMD is not a populated non-leaf entry.
> +  */
> + WRITE_ONCE(*pmdp, __pmd(0));
> + WARN_ON(!pmd_set_huge(pmdp, __pfn_to_phys(pfn), prot));
> + WARN_ON(!pmd_clear_huge(pmdp));
> + pmd = READ_ONCE(*pmdp);
> + WARN_ON(!pmd_none(pmd));
>  }
> +
>  static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t 
> prot)
>  {
> + pud_t pud;
> +
> + if (!arch_vmap_pud_supported(prot))
> + return;
> +
> + pr_debug("Validating PUD huge\n");
> + /*
> +  * X86 defined pud_set_huge() verifies that the given
> +  * PUD is not a populated non-leaf entry.
> +  */
> + WRITE_ONCE(*pudp, __pud(0));
> + WARN_ON(!pud_set_huge(pudp, __pfn_to_phys(pfn), prot));
> + WARN_ON(!pud_clear_huge(pudp));
> + pud = READ_ONCE(*pudp);
> + WARN_ON(!pud_none(pud));
>  }
> -static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) { }
> -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
> +static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t 
> prot) { }
> +static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t 
> prot) { }
> +#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
>  
>  static void __init p4d_basic_tests(unsigned long pfn, pgprot_t prot)
>  {
> 

With changes to the commit message as suggested earlier.

Reviewed-by: Anshuman Khandual

[PATCH] mm/page_alloc: Ensure that HUGETLB_PAGE_ORDER is less than MAX_ORDER

2021-04-08 Thread Anshuman Khandual

pageblock_order must always be less than MAX_ORDER, otherwise it might lead
to an warning during boot. A similar problem got fixed on arm64 platform
with the commit 79cc2ed5a716 ("arm64/mm: Drop THP conditionality from
FORCE_MAX_ZONEORDER"). Assert the above condition before HUGETLB_PAGE_ORDER
gets assigned as pageblock_order. This will help detect the problem earlier
on platforms where HUGETLB_PAGE_SIZE_VARIABLE is enabled.

Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
 mm/page_alloc.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 604dcd69397b..81b7460e1228 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7068,10 +7068,17 @@ void __init set_pageblock_order(void)
if (pageblock_order)
return;

-   if (HPAGE_SHIFT > PAGE_SHIFT)
+   if (HPAGE_SHIFT > PAGE_SHIFT) {
+   /*
+* pageblock_order must always be less than
+* MAX_ORDER. So does HUGETLB_PAGE_ORDER if
+* that is being assigned here.
+*/
+   WARN_ON(HUGETLB_PAGE_ORDER >= MAX_ORDER);
order = HUGETLB_PAGE_ORDER;
-   else
+   } else {
order = MAX_ORDER - 1;
+   }

/*
 * Assume the largest contiguous order of interest is a huge page.
-- 
2.20.1

Re: [PATCH] mm/page_alloc: Ensure that HUGETLB_PAGE_ORDER is less than MAX_ORDER

2021-04-09 Thread Anshuman Khandual




On 4/9/21 1:54 PM, David Hildenbrand wrote:
> On 09.04.21 07:55, Anshuman Khandual wrote:
>> pageblock_order must always be less than MAX_ORDER, otherwise it might lead
>> to an warning during boot. A similar problem got fixed on arm64 platform
>> with the commit 79cc2ed5a716 ("arm64/mm: Drop THP conditionality from
>> FORCE_MAX_ZONEORDER"). Assert the above condition before HUGETLB_PAGE_ORDER
>> gets assigned as pageblock_order. This will help detect the problem earlier
>> on platforms where HUGETLB_PAGE_SIZE_VARIABLE is enabled.
>>
>> Cc: Andrew Morton 
>> Cc: linux...@kvack.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Anshuman Khandual 
>> ---
>>   mm/page_alloc.c | 11 +--
>>   1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 604dcd69397b..81b7460e1228 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -7068,10 +7068,17 @@ void __init set_pageblock_order(void)
>>   if (pageblock_order)
>>   return;
>>   -    if (HPAGE_SHIFT > PAGE_SHIFT)
>> +    if (HPAGE_SHIFT > PAGE_SHIFT) {
>> +    /*
>> + * pageblock_order must always be less than
>> + * MAX_ORDER. So does HUGETLB_PAGE_ORDER if
>> + * that is being assigned here.
>> + */
>> +    WARN_ON(HUGETLB_PAGE_ORDER >= MAX_ORDER);
> 
> Can't that be a BUILD_BUG_ON() ?

Yes, it can be. Probably might be appropriate as well, given that both
the arguments here are compile time constants. Okay, will change.

> 
>>   order = HUGETLB_PAGE_ORDER;
>> -    else
>> +    } else {
>>   order = MAX_ORDER - 1;
>> +    }
>>     /*
>>    * Assume the largest contiguous order of interest is a huge page.
>>
> 
>

Re: [PATCH -next] coresight: core: Make symbol 'csdev_sink' static

2021-04-09 Thread Anshuman Khandual




On 4/9/21 7:02 AM, Wei Yongjun wrote:
> The sparse tool complains as follows:
> 
> drivers/hwtracing/coresight/coresight-core.c:26:1: warning:
>  symbol '__pcpu_scope_csdev_sink' was not declared. Should it be static?
> 
> This symbol is not used outside of coresight-core.c, so this
> commit marks it static.

commit ? It is not on the tree yet. s/commit/change instead.

> 
> Fixes: 2cd87a7b293d ("coresight: core: Add support for dedicated percpu 
> sinks")

There is no functional problem that this patch is proposing to fix
and hence the "Fixes:" tag is not warranted. Suzuki/Matihieu ?

> Reported-by: Hulk Robot 
> Signed-off-by: Wei Yongjun 
> ---
>  drivers/hwtracing/coresight/coresight-core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/hwtracing/coresight/coresight-core.c 
> b/drivers/hwtracing/coresight/coresight-core.c
> index 3e779e1619ed..6c68d34d956e 100644
> --- a/drivers/hwtracing/coresight/coresight-core.c
> +++ b/drivers/hwtracing/coresight/coresight-core.c
> @@ -23,7 +23,7 @@
>  #include "coresight-priv.h"
>  
>  static DEFINE_MUTEX(coresight_mutex);
> -DEFINE_PER_CPU(struct coresight_device *, csdev_sink);
> +static DEFINE_PER_CPU(struct coresight_device *, csdev_sink);
>  
>  /**
>   * struct coresight_node - elements of a path, from source to sink
> 

Otherwise LGTM. As csdev_sink is not being used outside coresight-core.c
file after the introduction of coresight_[set|get]_percpu_sink() helpers.

Re: [V6,1/9] elf: Add new powerpc specifc core note sections

2015-01-01 Thread Anshuman Khandual

On 12/20/2014 12:58 AM, Edjunior Barbosa Machado wrote:
> On 12/08/2014 08:08 AM, Anshuman Khandual wrote:
>> On 12/03/2014 12:18 PM, Anshuman Khandual wrote:
>>> On 12/03/2014 10:52 AM, Michael Ellerman wrote:
>>>> On Tue, 2014-02-12 at 07:56:45 UTC, Anshuman Khandual wrote:
>>>>> This patch adds four new ELF core note sections for powerpc
>>>>> transactional memory and one new ELF core note section for
>>>>> powerpc general miscellaneous debug registers. These addition
>>>>> of new ELF core note sections extends the existing ELF ABI
>>>>> without affecting it in any manner.
>>>>>
>>>>> Acked-by: Andrew Morton 
>>>>> Signed-off-by: Anshuman Khandual 
>>>>> ---
>>>>>  include/uapi/linux/elf.h | 5 +
>>>>>  1 file changed, 5 insertions(+)
>>>>>
>>>>> diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
>>>>> index ea9bf25..2260fc0 100644
>>>>> --- a/include/uapi/linux/elf.h
>>>>> +++ b/include/uapi/linux/elf.h
>>>>> @@ -379,6 +379,11 @@ typedef struct elf64_shdr {
>>>>>  #define NT_PPC_VMX   0x100   /* PowerPC Altivec/VMX 
>>>>> registers */
>>>>>  #define NT_PPC_SPE   0x101   /* PowerPC SPE/EVR registers */
>>>>>  #define NT_PPC_VSX   0x102   /* PowerPC VSX registers */
>>>>> +#define NT_PPC_TM_SPR0x103   /* PowerPC TM special registers 
>>>>> */
>>>>> +#define NT_PPC_TM_CGPR   0x104   /* PowerpC TM checkpointed GPR 
>>>>> */
>>>>> +#define NT_PPC_TM_CFPR   0x105   /* PowerPC TM checkpointed FPR 
>>>>> */
>>>>> +#define NT_PPC_TM_CVMX   0x106   /* PowerPC TM checkpointed VMX 
>>>>> */
>>>>> +#define NT_PPC_MISC  0x107   /* PowerPC miscellaneous 
>>>>> registers */
>>>>
>>>> This is a really terrible name, "MISC".
>>>>
>>>> Having said that, I guess it's accurate. We have a whole bunch of regs that
>>>> have accrued over recent years that aren't accessible via ptrace.
>>>>
>>>> It seems to me if we're adding a misc regset we should be adding 
>>>> everything we
>>>> might want to it that is currenty architected.
>>>
>>> But I believe they also need to be part of the thread_struct structure to be
>>> accessible from ptrace.
>>
>> Currently we dont context save/restore the PMC count registers (PMC1-PMC6)
>> during the process context switch. So the values of PMC1..PMC6 are not
>> thread specific in the structure. To be able to access them in ptrace
>> when the tracee has stopped, we need to context save these counters
>> in the thread struct. Shall we do that ? Then we can add them to the
>> MISC regset bucket irrespective of whats the value we get in there when
>> we probe through ptrace.
>>
>> The same goes for MMCRA, CFAR registers as well.
>>
>>>  
>>>>
>>>> But currently you only include the PPR, TAR & DSCR.
>>>
>>> Yeah, thats what we started with.
>>>
>>>>
>>>> Looking at Power ISA v2.07, I see the following that could be included:
>>>>
>>>>   MMCR2
>>>>   MMCRA
>>>>   PMC1
>>>>   PMC2
>>>>   PMC3
>>>>   PMC4
>>>>   PMC5
>>>>   PMC6
>>>>   MMCR0
>>>>   EBBHR
>>>>   EBBRR
>>>>   BESCR
>>>>   SIAR
>>>>   SDAR
>>>>   CFAR?
>>>
>>> MMCRA, PMC[1..6], EBBHR, BESCR, EBBRR, CFAR are not part of the thread 
>>> struct.
>>
>> Sorry. EBBRR, EBBHR, BESCR registers are part of the thread struct.
>>
>>>
>>>>
>>>> Those are all new in 2.07 except for CFAR.
>>>>
>>>> There might be more I missed, that was just a quick scan.
>>>>
>>>> Some are only accessible when EBB is in use, maybe those could be a 
>>>> separate
>>>> regset.
>>>
>>> Yeah we can have one more regset for EBB specific registers.
>>
>> Should the new EBB specific regset include only EBBRR, EBBHR, BESCR registers
>> or should it also include SIAR, SDAR, SIER, MMCR0, MMCR2 registers as well. I
>> was thinking about putting these five registers into the MISC bucket instead.
>> But

Re: [V6,1/9] elf: Add new powerpc specifc core note sections

2014-12-08 Thread Anshuman Khandual

On 12/03/2014 12:18 PM, Anshuman Khandual wrote:
> On 12/03/2014 10:52 AM, Michael Ellerman wrote:
>> On Tue, 2014-02-12 at 07:56:45 UTC, Anshuman Khandual wrote:
>>> This patch adds four new ELF core note sections for powerpc
>>> transactional memory and one new ELF core note section for
>>> powerpc general miscellaneous debug registers. These addition
>>> of new ELF core note sections extends the existing ELF ABI
>>> without affecting it in any manner.
>>>
>>> Acked-by: Andrew Morton 
>>> Signed-off-by: Anshuman Khandual 
>>> ---
>>>  include/uapi/linux/elf.h | 5 +
>>>  1 file changed, 5 insertions(+)
>>>
>>> diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
>>> index ea9bf25..2260fc0 100644
>>> --- a/include/uapi/linux/elf.h
>>> +++ b/include/uapi/linux/elf.h
>>> @@ -379,6 +379,11 @@ typedef struct elf64_shdr {
>>>  #define NT_PPC_VMX 0x100   /* PowerPC Altivec/VMX registers */
>>>  #define NT_PPC_SPE 0x101   /* PowerPC SPE/EVR registers */
>>>  #define NT_PPC_VSX 0x102   /* PowerPC VSX registers */
>>> +#define NT_PPC_TM_SPR  0x103   /* PowerPC TM special registers 
>>> */
>>> +#define NT_PPC_TM_CGPR 0x104   /* PowerpC TM checkpointed GPR 
>>> */
>>> +#define NT_PPC_TM_CFPR 0x105   /* PowerPC TM checkpointed FPR 
>>> */
>>> +#define NT_PPC_TM_CVMX 0x106   /* PowerPC TM checkpointed VMX 
>>> */
>>> +#define NT_PPC_MISC0x107   /* PowerPC miscellaneous 
>>> registers */
>>
>> This is a really terrible name, "MISC".
>>
>> Having said that, I guess it's accurate. We have a whole bunch of regs that
>> have accrued over recent years that aren't accessible via ptrace.
>>
>> It seems to me if we're adding a misc regset we should be adding everything 
>> we
>> might want to it that is currenty architected.
> 
> But I believe they also need to be part of the thread_struct structure to be
> accessible from ptrace.

Currently we dont context save/restore the PMC count registers (PMC1-PMC6)
during the process context switch. So the values of PMC1..PMC6 are not
thread specific in the structure. To be able to access them in ptrace
when the tracee has stopped, we need to context save these counters
in the thread struct. Shall we do that ? Then we can add them to the
MISC regset bucket irrespective of whats the value we get in there when
we probe through ptrace.

The same goes for MMCRA, CFAR registers as well.

>  
>>
>> But currently you only include the PPR, TAR & DSCR.
> 
> Yeah, thats what we started with.
> 
>>
>> Looking at Power ISA v2.07, I see the following that could be included:
>>
>>   MMCR2
>>   MMCRA
>>   PMC1
>>   PMC2
>>   PMC3
>>   PMC4
>>   PMC5
>>   PMC6
>>   MMCR0
>>   EBBHR
>>   EBBRR
>>   BESCR
>>   SIAR
>>   SDAR
>>   CFAR?
> 
> MMCRA, PMC[1..6], EBBHR, BESCR, EBBRR, CFAR are not part of the thread struct.

Sorry. EBBRR, EBBHR, BESCR registers are part of the thread struct.

> 
>>
>> Those are all new in 2.07 except for CFAR.
>>
>> There might be more I missed, that was just a quick scan.
>>
>> Some are only accessible when EBB is in use, maybe those could be a separate
>> regset.
> 
> Yeah we can have one more regset for EBB specific registers.

Should the new EBB specific regset include only EBBRR, EBBHR, BESCR registers
or should it also include SIAR, SDAR, SIER, MMCR0, MMCR2 registers as well. I
was thinking about putting these five registers into the MISC bucket instead.
But from the perf code, it looks like these five registers are also related to
the EBB context as well.

Some clarity on these points would really help.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] selftests/powerpc: Add .gitignore for powerpc selftests

2015-01-13 Thread Anshuman Khandual

On 01/14/2015 07:45 AM, Michael Ellerman wrote:
> On Tue, 2015-01-13 at 17:16 -0700, Shuah Khan wrote:
>> Please add a commit log.
>  
> What does it need to say?
> 
>> On 01/13/2015 04:49 PM, Michael Ellerman wrote:
>>> Signed-off-by: Michael Ellerman 
>>> ---
>>>  .../testing/selftests/powerpc/copyloops/.gitignore |  4 
>>>  tools/testing/selftests/powerpc/mm/.gitignore  |  1 +
>>>  tools/testing/selftests/powerpc/pmu/.gitignore |  3 +++
>>>  tools/testing/selftests/powerpc/pmu/ebb/.gitignore | 22 
>>> ++
>>>  .../selftests/powerpc/primitives/.gitignore|  1 +
>>>  tools/testing/selftests/powerpc/tm/.gitignore  |  1 +
>>>  6 files changed, 32 insertions(+)
>>>  create mode 100644 tools/testing/selftests/powerpc/copyloops/.gitignore
>>>  create mode 100644 tools/testing/selftests/powerpc/mm/.gitignore
>>>  create mode 100644 tools/testing/selftests/powerpc/pmu/.gitignore
>>>  create mode 100644 tools/testing/selftests/powerpc/pmu/ebb/.gitignore
>>>  create mode 100644 tools/testing/selftests/powerpc/primitives/.gitignore
>>>  create mode 100644 tools/testing/selftests/powerpc/tm/.gitignore
>>
>> Please create a single .gitignore for all targets right under
>> tools/testing/selftests/powerpc instead of multiple .gitignore
>> files.
> 
> Why? Having separate files makes it less likely we'll get merge conflicts
> between different test subdirectores, it also makes it more likely someone
> adding a test will notice they need to update the .gitignore in the same
> directory.

Hey Michael,

I had already posted a similar patch in this regard along with the ptrace
patches last month. Thats why I mentioned in the cover letter that
gitignore updates for all these tests will come from that patch series
instead of from this one. 

https://lkml.org/lkml/2014/12/2/80

Regards
Anshuman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [V6,1/9] elf: Add new powerpc specifc core note sections

2015-01-13 Thread Anshuman Khandual

On 01/01/2015 01:38 PM, Anshuman Khandual wrote:
>> > Also, we've noticed that the 'misc' regset contains registers from 
>> > different ISA
>> > versions (dscr and ppr appear in ISA 2.05, tar is from 2.07). I'm not sure 
>> > if
>> > there is a way to detect presence/validity of such registers, but perhaps 
>> > it
>> > might be a good idea to separate registers from different ISAs in different
>> > regsets.
> Thats right, will use feature CPU_FTR_ARCH_207S (which checks whether we are 
> v2.07
> compliant) to detect whether TAR register is available or not. 
> 

Need to correct something here. Run time detection of the presence of TAR 
register
through the feature bit CPU_FTR_ARCH_207S as I had mentioned before is not 
required.
Right now we take care of the compile time availability of the individual 
registers
the same way it is present on the thread struct. In the systems which do not 
have the
TAR register, thread.tar is always going to be 0 which is exactly the same 
value we
would get by excluding tar register copy after the run time detection of the 
feature.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V7 0/8] Add new powerpc specific ELF core notes

2015-01-13 Thread Anshuman Khandual

 (PASSED)
TM RN GPR[12]: 3 (PASSED)
TM RN GPR[13]: 2 (PASSED)
TM RN GPR[14]: 1 (PASSED)
TM RN GPR[15]: 0 (PASSED)
TM RN GPR[16]: f (PASSED)
TM RN GPR[17]: e (PASSED)
TM RN GPR[18]: d (PASSED)
TM RN GPR[19]: c (PASSED)
TM RN GPR[20]: b (PASSED)
TM RN GPR[21]: a (PASSED)
TM RN GPR[22]: 9 (PASSED)
TM RN GPR[23]: 8 (PASSED)
TM RN GPR[24]: 7 (PASSED)
TM RN GPR[25]: 6 (PASSED)
TM RN GPR[26]: 5 (PASSED)
TM RN GPR[27]: 4 (PASSED)
TM RN GPR[28]: 3 (PASSED)
TM RN GPR[29]: 2 (PASSED)
TM RN GPR[30]: 1 (PASSED)
TM RN GPR[31]: 0 (PASSED)
Testing TM running FPR:
TM RN FPSCR: 0
TM RN FPR[0]: f (PASSED)
TM RN FPR[1]: e (PASSED)
TM RN FPR[2]: d (PASSED)
TM RN FPR[3]: c (PASSED)
TM RN FPR[4]: b (PASSED)
TM RN FPR[5]: a (PASSED)
TM RN FPR[6]: 9 (PASSED)
TM RN FPR[7]: 8 (PASSED)
TM RN FPR[8]: 7 (PASSED)
TM RN FPR[9]: 6 (PASSED)
TM RN FPR[10]: 5 (PASSED)
TM RN FPR[11]: 4 (PASSED)
TM RN FPR[12]: 3 (PASSED)
TM RN FPR[13]: 2 (PASSED)
TM RN FPR[14]: 1 (PASSED)
TM RN FPR[15]: 0 (PASSED)
TM RN FPR[16]: f (PASSED)
TM RN FPR[17]: e (PASSED)
TM RN FPR[18]: d (PASSED)
TM RN FPR[19]: c (PASSED)
TM RN FPR[20]: b (PASSED)
TM RN FPR[21]: a (PASSED)
TM RN FPR[22]: 9 (PASSED)
TM RN FPR[23]: 8 (PASSED)
TM RN FPR[24]: 7 (PASSED)
TM RN FPR[25]: 6 (PASSED)
TM RN FPR[26]: 5 (PASSED)
TM RN FPR[27]: 4 (PASSED)
TM RN FPR[28]: 3 (PASSED)
TM RN FPR[29]: 2 (PASSED)
TM RN FPR[30]: 1 (PASSED)
TM RN FPR[31]: 0 (PASSED)
Testing TM running MISC debug registers:
TM RN DSCR: 32 (PASSED)
TM RN TAR: 3c (PASSED)
TM RN PPR: 4 (PASSED)
success: tm_ptrace

Anshuman Khandual (8):
  elf: Add new powerpc specifc core note sections
  powerpc, process: Add the function flush_tmregs_to_thread
  powerpc, ptrace: Enable fpr_(get/set) for transactional memory
  powerpc, ptrace: Enable vr_(get/set) for transactional memory
  powerpc, ptrace: Enable support for transactional memory register sets
  powerpc, ptrace: Enable support for miscellaneous debug registers
  selftests, powerpc: Add test case for TM related ptrace interface
  selftests: Make GIT ignore all binaries in powerpc test suite

 arch/powerpc/include/asm/switch_to.h   |8 +
 arch/powerpc/include/uapi/asm/elf.h|3 +
 arch/powerpc/kernel/process.c  |   20 +
 arch/powerpc/kernel/ptrace.c   | 1044 +++-
 include/uapi/linux/elf.h   |5 +
 .../testing/selftests/powerpc/copyloops/.gitignore |4 +
 tools/testing/selftests/powerpc/mm/.gitignore  |1 +
 tools/testing/selftests/powerpc/pmu/.gitignore |3 +
 tools/testing/selftests/powerpc/pmu/ebb/.gitignore |   22 +
 .../selftests/powerpc/primitives/.gitignore|1 +
 tools/testing/selftests/powerpc/tm/.gitignore  |2 +
 tools/testing/selftests/powerpc/tm/Makefile|2 +-
 tools/testing/selftests/powerpc/tm/tm-ptrace.c |  542 ++
 13 files changed, 1632 insertions(+), 25 deletions(-)
 create mode 100644 tools/testing/selftests/powerpc/copyloops/.gitignore
 create mode 100644 tools/testing/selftests/powerpc/mm/.gitignore
 create mode 100644 tools/testing/selftests/powerpc/pmu/.gitignore
 create mode 100644 tools/testing/selftests/powerpc/pmu/ebb/.gitignore
 create mode 100644 tools/testing/selftests/powerpc/primitives/.gitignore
 create mode 100644 tools/testing/selftests/powerpc/tm/.gitignore
 create mode 100644 tools/testing/selftests/powerpc/tm/tm-ptrace.c

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V7 2/8] powerpc, process: Add the function flush_tmregs_to_thread

2015-01-13 Thread Anshuman Khandual

This patch creates a function flush_tmregs_to_thread which
will then be used by subsequent patches in this series. The
function checks for self tracing ptrace interface attempts
while in the TM context and logs appropriate warning message.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/include/asm/switch_to.h |  8 
 arch/powerpc/kernel/process.c| 20 
 2 files changed, 28 insertions(+)

diff --git a/arch/powerpc/include/asm/switch_to.h 
b/arch/powerpc/include/asm/switch_to.h
index 58abeda..23752a9 100644
--- a/arch/powerpc/include/asm/switch_to.h
+++ b/arch/powerpc/include/asm/switch_to.h
@@ -82,6 +82,14 @@ static inline void flush_spe_to_thread(struct task_struct *t)
 }
 #endif
 
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+extern void flush_tmregs_to_thread(struct task_struct *);
+#else
+static inline void flush_tmregs_to_thread(struct task_struct *t)
+{
+}
+#endif
+
 static inline void clear_task_ebb(struct task_struct *t)
 {
 #ifdef CONFIG_PPC_BOOK3S_64
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index b4cc7be..a6f7ca5 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -745,6 +745,26 @@ void restore_tm_state(struct pt_regs *regs)
 #define __switch_to_tm(prev)
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
 
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+void flush_tmregs_to_thread(struct task_struct *tsk)
+{
+   /*
+* Process self tracing is not yet supported through
+* ptrace interface. Ptrace generic code should have
+* prevented this from happening in the first place.
+* Warn once here with the message, if some how it
+* is attempted.
+*/
+   WARN_ONCE(tsk == current,
+   "Not expecting ptrace on self: TM regs may be incorrect\n");
+
+   /*
+* If task is not current, it should have been flushed
+* already to it's thread_struct during __switch_to().
+*/
+}
+#endif
+
 struct task_struct *__switch_to(struct task_struct *prev,
struct task_struct *new)
 {
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V7 5/8] powerpc, ptrace: Enable support for transactional memory register sets

2015-01-13 Thread Anshuman Khandual

This patch enables get and set of transactional memory related register
sets through PTRACE_GETREGSET-PTRACE_SETREGSET interface by implementing
four new powerpc specific register sets i.e REGSET_TM_SPR, REGSET_TM_CGPR,
REGSET_TM_CFPR, REGSET_CVMX support corresponding to these following new
ELF core note types added previously in this regard.

(1) NT_PPC_TM_SPR
(2) NT_PPC_TM_CGPR
(3) NT_PPC_TM_CFPR
(4) NT_PPC_TM_CVMX

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/include/uapi/asm/elf.h |   2 +
 arch/powerpc/kernel/ptrace.c| 714 +++-
 2 files changed, 701 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/uapi/asm/elf.h 
b/arch/powerpc/include/uapi/asm/elf.h
index 59dad11..fdc8e2f 100644
--- a/arch/powerpc/include/uapi/asm/elf.h
+++ b/arch/powerpc/include/uapi/asm/elf.h
@@ -91,6 +91,8 @@
 
 #define ELF_NGREG  48  /* includes nip, msr, lr, etc. */
 #define ELF_NFPREG 33  /* includes fpscr */
+#define ELF_NVMX   34  /* includes all vector registers */
+#define ELF_NTMSPRREG  7   /* includes TM sprs, org_msr, dscr, tar, ppr */
 
 typedef unsigned long elf_greg_t64;
 typedef elf_greg_t64 elf_gregset_t64[ELF_NGREG];
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 2bbbd10..b14397c 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -63,6 +63,11 @@ struct pt_regs_offset {
{.name = STR(gpr##num), .offset = offsetof(struct pt_regs, gpr[num])}
 #define REG_OFFSET_END {.name = NULL, .offset = 0}
 
+/* Some common structure offsets */
+#define TSO(f) (offsetof(struct thread_struct, f))
+#define TVSO(f)(offsetof(struct thread_vr_state, f))
+#define TFSO(f)(offsetof(struct thread_fp_state, f))
+
 static const struct pt_regs_offset regoffset_table[] = {
GPR_OFFSET_NAME(0),
GPR_OFFSET_NAME(1),
@@ -792,6 +797,579 @@ static int evr_set(struct task_struct *target, const 
struct user_regset *regset,
 }
 #endif /* CONFIG_SPE */
 
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+/**
+ * tm_spr_active - get active number of registers in TM SPR
+ * @target:The target task.
+ * @regset:The user regset structure.
+ *
+ * This function checks the active number of available
+ * regisers in the transactional memory SPR category.
+ */
+static int tm_spr_active(struct task_struct *target,
+const struct user_regset *regset)
+{
+   if (!cpu_has_feature(CPU_FTR_TM))
+   return -ENODEV;
+
+   if (!MSR_TM_ACTIVE(target->thread.regs->msr))
+   return 0;
+
+   return regset->n;
+}
+
+/**
+ * tm_spr_get - get the TM related SPR registers
+ * @target:The target task.
+ * @regset:The user regset structure.
+ * @pos:   The buffer position.
+ * @count: Number of bytes to copy.
+ * @kbuf:  Kernel buffer to copy from.
+ * @ubuf:  User buffer to copy into.
+ *
+ * This function gets transactional memory related SPR registers.
+ * The userspace interface buffer layout is as follows.
+ *
+ * struct {
+ * u64 tm_tfhar;
+ * u64 tm_texasr;
+ * u64 tm_tfiar;
+ * unsigned long   tm_orig_msr;
+ * unsigned long   tm_tar;
+ * unsigned long   tm_ppr;
+ * unsigned long   tm_dscr;
+ * };
+ */
+static int tm_spr_get(struct task_struct *target,
+ const struct user_regset *regset,
+ unsigned int pos, unsigned int count,
+ void *kbuf, void __user *ubuf)
+{
+   int ret;
+
+   /* Build tests */
+   BUILD_BUG_ON(TSO(tm_tfhar) + sizeof(u64) != TSO(tm_texasr));
+   BUILD_BUG_ON(TSO(tm_texasr) + sizeof(u64) != TSO(tm_tfiar));
+   BUILD_BUG_ON(TSO(tm_tfiar) + sizeof(u64) != TSO(tm_orig_msr));
+   BUILD_BUG_ON(TSO(ckpt_regs) + sizeof(struct pt_regs) != TSO(tm_tar));
+   BUILD_BUG_ON(TSO(tm_tar) + sizeof(unsigned long) != TSO(tm_ppr));
+   BUILD_BUG_ON(TSO(tm_ppr) + sizeof(unsigned long) != TSO(tm_dscr));
+
+   if (!cpu_has_feature(CPU_FTR_TM))
+   return -ENODEV;
+
+   if (!MSR_TM_ACTIVE(target->thread.regs->msr))
+   return -ENODATA;
+
+   /* Flush the states */
+   flush_fp_to_thread(target);
+   flush_altivec_to_thread(target);
+   flush_tmregs_to_thread(target);
+
+   /* TFHAR register */
+   ret = user_regset_copyout(&pos, &count, &kbuf, &ubuf,
+   &target->thread.tm_tfhar, 0, sizeof(u64));
+
+   /* TEXASR register */
+   if (!ret)
+   ret = user_regset_copyout(&pos, &count, &kbuf, &ubuf,
+   &target->thread.tm_texasr, sizeof(u64),
+   2 * sizeof(u64));
+
+   /* TFIAR register */
+   if (!ret)
+   ret = user_regset_copyout(&pos, &count, &kbuf, &ubuf,
+

[PATCH V7 6/8] powerpc, ptrace: Enable support for miscellaneous debug registers

2015-01-13 Thread Anshuman Khandual

This patch enables get and set of miscellaneous debug registers through
ptrace PTRACE_GETREGSET-PTRACE_SETREGSET interface by implementing new
powerpc specific register set REGSET_MISC support corresponding to the
new ELF core note NT_PPC_MISC added previously in this regard.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/include/uapi/asm/elf.h |   1 +
 arch/powerpc/kernel/ptrace.c| 133 
 2 files changed, 134 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/elf.h 
b/arch/powerpc/include/uapi/asm/elf.h
index fdc8e2f..a41bd98 100644
--- a/arch/powerpc/include/uapi/asm/elf.h
+++ b/arch/powerpc/include/uapi/asm/elf.h
@@ -93,6 +93,7 @@
 #define ELF_NFPREG 33  /* includes fpscr */
 #define ELF_NVMX   34  /* includes all vector registers */
 #define ELF_NTMSPRREG  7   /* includes TM sprs, org_msr, dscr, tar, ppr */
+#define ELF_NMISCREG   3   /* includes dscr, tar, ppr */
 
 typedef unsigned long elf_greg_t64;
 typedef elf_greg_t64 elf_gregset_t64[ELF_NGREG];
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index b14397c..38a1147 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -1371,6 +1371,122 @@ static int tm_cvmx_set(struct task_struct *target,
 }
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
 
+#ifdef CONFIG_PPC64
+/**
+ * get_misc_dbg() - get MISC debug registers
+ * @target:The target task.
+ * @regset:The user regset structure.
+ * @pos:   The buffer position.
+ * @count: Number of bytes to copy.
+ * @kbuf:  Kernel buffer to copy from.
+ * @ubuf:  User buffer to copy into.
+ *
+ * This function gets various miscellaneous debug registers which includes
+ * DSCR, PPR and TAR. The userspace intarface buffer layout is as follows.
+ *
+ * struct {
+ * unsigned long dscr;
+ * unsigned long ppr;
+ * unsigned long tar;
+ * };
+ *
+ * The data element 'tar' in the structure will be valid only if the kernel
+ * has CONFIG_PPC_BOOK3S_64 config option enabled.
+ */
+static int get_misc_dbg(struct task_struct *target,
+   const struct user_regset *regset, unsigned int pos,
+   unsigned int count, void *kbuf, void __user *ubuf)
+{
+   int ret;
+
+   /* Build test */
+   BUILD_BUG_ON(TSO(dscr) + 2 * sizeof(unsigned long) != TSO(ppr));
+
+#ifdef CONFIG_PPC_BOOK3S_64
+   BUILD_BUG_ON(TSO(ppr) + sizeof(unsigned long) != TSO(tar));
+#endif
+
+   /* DSCR register */
+   ret = user_regset_copyout(&pos, &count, &kbuf, &ubuf,
+   &target->thread.dscr, 0,
+   sizeof(unsigned long));
+
+   /* PPR register */
+   if (!ret)
+   ret = user_regset_copyout(&pos, &count, &kbuf, &ubuf,
+   &target->thread.ppr,
+   sizeof(unsigned long),
+   2 * sizeof(unsigned long));
+
+#ifdef CONFIG_PPC_BOOK3S_64
+   /* TAR register */
+   if (!ret)
+   ret = user_regset_copyout(&pos, &count, &kbuf, &ubuf,
+   &target->thread.tar,
+   2 * sizeof(unsigned long),
+   3 * sizeof(unsigned long));
+#endif
+   return ret;
+}
+
+/**
+ * set_misc_dbg() - set MISC debug registers
+ * @target:The target task.
+ * @regset:The user regset structure.
+ * @pos:   The buffer position.
+ * @count: Number of bytes to copy.
+ * @kbuf:  Kernel buffer to copy into.
+ * @ubuf:  User buffer to copy from.
+ *
+ * This function sets various miscellaneous debug registers which includes
+ * DSCR, PPR and TAR. The userspace intarface buffer layout is as follows.
+ *
+ * struct {
+ * unsigned long dscr;
+ * unsigned long ppr;
+ * unsigned long tar;
+ * };
+ *
+ * The data element 'tar' in the structure will be valid only if the kernel
+ * has CONFIG_PPC_BOOK3S_64 config option enabled.
+ */
+static int set_misc_dbg(struct task_struct *target,
+   const struct user_regset *regset, unsigned int pos,
+   unsigned int count, const void *kbuf,
+   const void __user *ubuf)
+{
+   int ret;
+
+   /* Build test */
+   BUILD_BUG_ON(TSO(dscr) + 2 * sizeof(unsigned long) != TSO(ppr));
+
+#ifdef CONFIG_PPC_BOOK3S_64
+   BUILD_BUG_ON(TSO(ppr) + sizeof(unsigned long) != TSO(tar));
+#endif
+
+   /* DSCR register */
+   ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf,
+   &target->thread.dscr, 0,
+   sizeof(unsigned long));
+
+   /* PPR register */
+   if (!ret)
+

[PATCH V7 3/8] powerpc, ptrace: Enable fpr_(get/set) for transactional memory

2015-01-13 Thread Anshuman Khandual

This patch enables the fpr_get which gets the running value of all
the FPR registers and the fpr_set which sets the running value of
of all the FPR registers to accommodate in transaction ptrace
interface based requests.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/kernel/ptrace.c | 103 ---
 1 file changed, 97 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index f21897b..fd36b32 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -357,6 +357,33 @@ static int gpr_set(struct task_struct *target, const 
struct user_regset *regset,
return ret;
 }
 
+
+/*
+ * fpr_get
+ *
+ * When the transaction is active, 'transact_fp' holds the current running
+ * value of all FPR registers and 'fp_state' holds the last checkpointed
+ * value of all FPR registers for the current transaction. When transaction
+ * is not active 'fp_state' holds the current running state of all the FPR
+ * registers. So this function which returns the current running values of
+ * all the FPR registers, needs to know whether any transaction is active
+ * or not.
+ *
+ * Userspace interface buffer layout:
+ *
+ * struct data {
+ * u64 fpr[32];
+ * u64 fpscr;
+ * };
+ *
+ * There are two config options CONFIG_VSX and CONFIG_PPC_TRANSACTIONAL_MEM
+ * which determines the final code in this function. All the combinations of
+ * these two config options are possible except the one below as transactional
+ * memory config pulls in CONFIG_VSX automatically.
+ *
+ * !defined(CONFIG_VSX) && defined(CONFIG_PPC_TRANSACTIONAL_MEM)
+ *
+ */
 static int fpr_get(struct task_struct *target, const struct user_regset 
*regset,
   unsigned int pos, unsigned int count,
   void *kbuf, void __user *ubuf)
@@ -367,22 +394,64 @@ static int fpr_get(struct task_struct *target, const 
struct user_regset *regset,
 #endif
flush_fp_to_thread(target);
 
-#ifdef CONFIG_VSX
+#if defined(CONFIG_VSX) && defined(CONFIG_PPC_TRANSACTIONAL_MEM)
+   /* copy to local buffer then write that out */
+   if (MSR_TM_ACTIVE(target->thread.regs->msr)) {
+   flush_altivec_to_thread(target);
+   flush_tmregs_to_thread(target);
+   for (i = 0; i < 32 ; i++)
+   buf[i] = target->thread.TS_TRANS_FPR(i);
+   buf[32] = target->thread.transact_fp.fpscr;
+   } else {
+   for (i = 0; i < 32 ; i++)
+   buf[i] = target->thread.TS_FPR(i);
+   buf[32] = target->thread.fp_state.fpscr;
+   }
+   return user_regset_copyout(&pos, &count, &kbuf, &ubuf, buf, 0, -1);
+#endif
+
+#if defined(CONFIG_VSX) && !defined(CONFIG_PPC_TRANSACTIONAL_MEM)
/* copy to local buffer then write that out */
for (i = 0; i < 32 ; i++)
buf[i] = target->thread.TS_FPR(i);
buf[32] = target->thread.fp_state.fpscr;
return user_regset_copyout(&pos, &count, &kbuf, &ubuf, buf, 0, -1);
+#endif
 
-#else
+
+#if !defined(CONFIG_VSX) && !defined(CONFIG_PPC_TRANSACTIONAL_MEM)
BUILD_BUG_ON(offsetof(struct thread_fp_state, fpscr) !=
 offsetof(struct thread_fp_state, fpr[32][0]));
-
return user_regset_copyout(&pos, &count, &kbuf, &ubuf,
   &target->thread.fp_state, 0, -1);
 #endif
 }
 
+/*
+ * fpr_set
+ *
+ * When the transaction is active, 'transact_fp' holds the current running
+ * value of all FPR registers and 'fp_state' holds the last checkpointed
+ * value of all FPR registers for the current transaction. When transaction
+ * is not active 'fp_state' holds the current running state of all the FPR
+ * registers. So this function which setss the current running values of
+ * all the FPR registers, needs to know whether any transaction is active
+ * or not.
+ *
+ * Userspace interface buffer layout:
+ *
+ * struct data {
+ * u64 fpr[32];
+ * u64 fpscr;
+ * };
+ *
+ * There are two config options CONFIG_VSX and CONFIG_PPC_TRANSACTIONAL_MEM
+ * which determines the final code in this function. All the combinations of
+ * these two config options are possible except the one below as transactional
+ * memory config pulls in CONFIG_VSX automatically.
+ *
+ * !defined(CONFIG_VSX) && defined(CONFIG_PPC_TRANSACTIONAL_MEM)
+ */
 static int fpr_set(struct task_struct *target, const struct user_regset 
*regset,
   unsigned int pos, unsigned int count,
   const void *kbuf, const void __user *ubuf)
@@ -393,7 +462,27 @@ static int fpr_set(struct task_struct *target, const 
struct user_regset *regset,
 #endif
flush_fp_to_thread(target);
 
-#ifdef CONFIG_VSX
+#if defined(CONFI

[PATCH V7 4/8] powerpc, ptrace: Enable vr_(get/set) for transactional memory

2015-01-13 Thread Anshuman Khandual

This patch enables the vr_get which gets the running value of all
the VMX registers and the vr_set which sets the running value of
of all the VMX registers to accommodate in transaction ptrace
interface based requests.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/kernel/ptrace.c | 94 ++--
 1 file changed, 91 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index fd36b32..2bbbd10 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -523,10 +523,30 @@ static int vr_active(struct task_struct *target,
return target->thread.used_vr ? regset->n : 0;
 }
 
+/*
+ * vr_get
+ *
+ * When the transaction is active, 'transact_vr' holds the current running
+ * value of all the VMX registers and 'vr_state' holds the last checkpointed
+ * value of all the VMX registers for the current transaction to fall back
+ * on in case it aborts. When transaction is not active 'vr_state' holds
+ * the current running state of all the VMX registers. So this function which
+ * gets the current running values of all the VMX registers, needs to know
+ * whether any transaction is active or not.
+ *
+ * Userspace interface buffer layout:
+ *
+ * struct data {
+ * vector128   vr[32];
+ * vector128   vscr;
+ * vector128   vrsave;
+ * };
+ */
 static int vr_get(struct task_struct *target, const struct user_regset *regset,
  unsigned int pos, unsigned int count,
  void *kbuf, void __user *ubuf)
 {
+   struct thread_vr_state *addr;
int ret;
 
flush_altivec_to_thread(target);
@@ -534,8 +554,19 @@ static int vr_get(struct task_struct *target, const struct 
user_regset *regset,
BUILD_BUG_ON(offsetof(struct thread_vr_state, vscr) !=
 offsetof(struct thread_vr_state, vr[32]));
 
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+   if (MSR_TM_ACTIVE(target->thread.regs->msr)) {
+   flush_fp_to_thread(target);
+   flush_tmregs_to_thread(target);
+   addr = &target->thread.transact_vr;
+   } else {
+   addr = &target->thread.vr_state;
+   }
+#else
+   addr = &target->thread.vr_state;
+#endif
ret = user_regset_copyout(&pos, &count, &kbuf, &ubuf,
- &target->thread.vr_state, 0,
+ addr, 0,
  33 * sizeof(vector128));
if (!ret) {
/*
@@ -546,7 +577,16 @@ static int vr_get(struct task_struct *target, const struct 
user_regset *regset,
u32 word;
} vrsave;
memset(&vrsave, 0, sizeof(vrsave));
+
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+   if (MSR_TM_ACTIVE(target->thread.regs->msr))
+   vrsave.word = target->thread.transact_vrsave;
+   else
+   vrsave.word = target->thread.vrsave;
+#else
vrsave.word = target->thread.vrsave;
+#endif
+
ret = user_regset_copyout(&pos, &count, &kbuf, &ubuf, &vrsave,
  33 * sizeof(vector128), -1);
}
@@ -554,10 +594,30 @@ static int vr_get(struct task_struct *target, const 
struct user_regset *regset,
return ret;
 }
 
+/*
+ * vr_set
+ *
+ * When the transaction is active, 'transact_vr' holds the current running
+ * value of all the VMX registers and 'vr_state' holds the last checkpointed
+ * value of all the VMX registers for the current transaction to fall back
+ * on in case it aborts. When transaction is not active 'vr_state' holds
+ * the current running state of all the VMX registers. So this function which
+ * sets the current running values of all the VMX registers, needs to know
+ * whether any transaction is active or not.
+ *
+ * Userspace interface buffer layout:
+ *
+ * struct data {
+ * vector128   vr[32];
+ * vector128   vscr;
+ * vector128   vrsave;
+ * };
+ */
 static int vr_set(struct task_struct *target, const struct user_regset *regset,
  unsigned int pos, unsigned int count,
  const void *kbuf, const void __user *ubuf)
 {
+   struct thread_vr_state *addr;
int ret;
 
flush_altivec_to_thread(target);
@@ -565,8 +625,19 @@ static int vr_set(struct task_struct *target, const struct 
user_regset *regset,
BUILD_BUG_ON(offsetof(struct thread_vr_state, vscr) !=
 offsetof(struct thread_vr_state, vr[32]));
 
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+   if (MSR_TM_ACTIVE(target->thread.regs->msr)) {
+   flush_fp_to_thread(target);
+   flush_tmregs_to_thread(target);
+   addr = &target->thread.transact_vr;
+

[PATCH V7 8/8] selftests: Make GIT ignore all binaries in powerpc test suite

2015-01-13 Thread Anshuman Khandual

This patch includes all of the powerpc test binaries into the
.gitignore file listing in their respective directories. This
will make sure that GIT ignores all of these test binaries while
displaying status.

Signed-off-by: Anshuman Khandual 
---
NOTE: As Michael had mentioned (https://lkml.org/lkml/2015/1/13/868)
adding binaries into different .gitignore file in their respective
test directories under powerpc.

 .../testing/selftests/powerpc/copyloops/.gitignore |  4 
 tools/testing/selftests/powerpc/mm/.gitignore  |  1 +
 tools/testing/selftests/powerpc/pmu/.gitignore |  3 +++
 tools/testing/selftests/powerpc/pmu/ebb/.gitignore | 22 ++
 .../selftests/powerpc/primitives/.gitignore|  1 +
 tools/testing/selftests/powerpc/tm/.gitignore  |  2 ++
 6 files changed, 33 insertions(+)
 create mode 100644 tools/testing/selftests/powerpc/copyloops/.gitignore
 create mode 100644 tools/testing/selftests/powerpc/mm/.gitignore
 create mode 100644 tools/testing/selftests/powerpc/pmu/.gitignore
 create mode 100644 tools/testing/selftests/powerpc/pmu/ebb/.gitignore
 create mode 100644 tools/testing/selftests/powerpc/primitives/.gitignore
 create mode 100644 tools/testing/selftests/powerpc/tm/.gitignore

diff --git a/tools/testing/selftests/powerpc/copyloops/.gitignore 
b/tools/testing/selftests/powerpc/copyloops/.gitignore
new file mode 100644
index 000..25a192f
--- /dev/null
+++ b/tools/testing/selftests/powerpc/copyloops/.gitignore
@@ -0,0 +1,4 @@
+copyuser_64
+copyuser_power7
+memcpy_64
+memcpy_power7
diff --git a/tools/testing/selftests/powerpc/mm/.gitignore 
b/tools/testing/selftests/powerpc/mm/.gitignore
new file mode 100644
index 000..156f4e8
--- /dev/null
+++ b/tools/testing/selftests/powerpc/mm/.gitignore
@@ -0,0 +1 @@
+hugetlb_vs_thp_test
diff --git a/tools/testing/selftests/powerpc/pmu/.gitignore 
b/tools/testing/selftests/powerpc/pmu/.gitignore
new file mode 100644
index 000..e748f33
--- /dev/null
+++ b/tools/testing/selftests/powerpc/pmu/.gitignore
@@ -0,0 +1,3 @@
+count_instructions
+l3_bank_test
+per_event_excludes
diff --git a/tools/testing/selftests/powerpc/pmu/ebb/.gitignore 
b/tools/testing/selftests/powerpc/pmu/ebb/.gitignore
new file mode 100644
index 000..42bddbe
--- /dev/null
+++ b/tools/testing/selftests/powerpc/pmu/ebb/.gitignore
@@ -0,0 +1,22 @@
+reg_access_test
+event_attributes_test
+cycles_test
+cycles_with_freeze_test
+pmc56_overflow_test
+ebb_vs_cpu_event_test
+cpu_event_vs_ebb_test
+cpu_event_pinned_vs_ebb_test
+task_event_vs_ebb_test
+task_event_pinned_vs_ebb_test
+multi_ebb_procs_test
+multi_counter_test
+pmae_handling_test
+close_clears_pmcc_test
+instruction_count_test
+fork_cleanup_test
+ebb_on_child_test
+ebb_on_willing_child_test
+back_to_back_ebbs_test
+lost_exception_test
+no_handler_test
+cycles_with_mmcr2_test
diff --git a/tools/testing/selftests/powerpc/primitives/.gitignore 
b/tools/testing/selftests/powerpc/primitives/.gitignore
new file mode 100644
index 000..4cc4e31
--- /dev/null
+++ b/tools/testing/selftests/powerpc/primitives/.gitignore
@@ -0,0 +1 @@
+load_unaligned_zeropad
diff --git a/tools/testing/selftests/powerpc/tm/.gitignore 
b/tools/testing/selftests/powerpc/tm/.gitignore
new file mode 100644
index 000..71f9f9d
--- /dev/null
+++ b/tools/testing/selftests/powerpc/tm/.gitignore
@@ -0,0 +1,2 @@
+tm-ptrace
+tm-resched-dscr
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V7 7/8] selftests, powerpc: Add test case for TM related ptrace interface

2015-01-13 Thread Anshuman Khandual

This patch adds one more test case called 'tm-ptrace' targeting TM
related ptrace interface. This test creates one child process to
run some basic TM transactions and the parent process attaches the
child to do some ptrace probing using the recently added regset
interfaces. The parent process then compares the received values
against the expected values to verify whether it has passed the
given test or not.

Signed-off-by: Anshuman Khandual 
---
 tools/testing/selftests/powerpc/tm/Makefile|   2 +-
 tools/testing/selftests/powerpc/tm/tm-ptrace.c | 542 +
 2 files changed, 543 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/powerpc/tm/tm-ptrace.c

diff --git a/tools/testing/selftests/powerpc/tm/Makefile 
b/tools/testing/selftests/powerpc/tm/Makefile
index 2cede23..71d400a 100644
--- a/tools/testing/selftests/powerpc/tm/Makefile
+++ b/tools/testing/selftests/powerpc/tm/Makefile
@@ -1,4 +1,4 @@
-PROGS := tm-resched-dscr
+PROGS := tm-resched-dscr tm-ptrace
 
 all: $(PROGS)
 
diff --git a/tools/testing/selftests/powerpc/tm/tm-ptrace.c 
b/tools/testing/selftests/powerpc/tm/tm-ptrace.c
new file mode 100644
index 000..7a6c7d3
--- /dev/null
+++ b/tools/testing/selftests/powerpc/tm/tm-ptrace.c
@@ -0,0 +1,542 @@
+/*
+ * Test program for TM ptrace interface
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Copyright 2014 IBM Corporation
+ *
+ * Author: Anshuman Khandual 
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+
+#define TEST_PASS 0
+#define TEST_FAIL 1
+
+#define MAX_STR_LENGTH 100
+
+/* ELF core notes */
+#define NT_PPC_TM_SPR  0x103   /* PowerPC TM special registers */
+#define NT_PPC_TM_CGPR 0x104   /* PowerpC TM checkpointed GPR */
+#define NT_PPC_TM_CFPR 0x105   /* PowerPC TM checkpointed FPR */
+#define NT_PPC_TM_CVMX 0x106   /* PowerPC TM checkpointed VMX */
+#define NT_PPC_MISC0x107   /* PowerPC miscellaneous registers */
+
+/* TM instructions */
+#define TBEGIN ".long 0x7C00051D ;"
+#define TEND   ".long 0x7C00055D ;"
+
+/* SPR number */
+#define SPRN_DSCR  3
+#define SPRN_TAR   815
+#define SPRN_PPR   896
+
+#define C_DSCR 10  /* TM checkpointed DSCR */
+#define C_TAR  20  /* TM checkpointed TAR */
+#define C_PPR  0x8 /* TM checkpointed PPR */
+
+#define DSCR   50  /* TM running DSCR */
+#define TAR60  /* TM running TAR */
+#define PPR0x4 /* TM running PPR */
+
+/* Values for GPR-FPR[0..31] */
+#define VAL0   0
+#define VAL1   1
+#define VAL2   2
+#define VAL3   3
+#define VAL4   4
+#define VAL5   5
+#define VAL6   6
+#define VAL7   7
+#define VAL8   8
+#define VAL9   9
+#define VAL10  10
+#define VAL11  11
+#define VAL12  12
+#define VAL13  13
+#define VAL14  14
+#define VAL15  15
+#define VAL_MAX16
+
+/* Standard data types */
+typedef unsigned int u32;
+typedef __vector128 vector128;
+
+/* NT_PPC_TM_SPR buffer layout */
+struct tm_spr_regs {
+   u64 tm_tfhar;
+   u64 tm_texasr;
+   u64 tm_tfiar;
+   u64 tm_orig_msr;
+   u64 tm_tar;
+   u64 tm_ppr;
+   u64 tm_dscr;
+};
+
+/*
+ * NT_PPC_TM_CGPR buffer layout
+ *
+ * Same as that of struct pt_regs
+ */
+
+/* NT_PPC_TM_CFPR buffer layout */
+struct tm_cfpr {
+   u64 fpr[32];
+   u64 fpscr;
+};
+
+/* NT_PPC_TM_CVMX buffer layout */
+struct tm_cvmx {
+   vector128   vr[32] __attribute__((aligned(16)));
+   vector128   vscr __attribute__((aligned(16)));
+   u32 vrsave;
+};
+
+/* NT_PPC_MISC buffer layout */
+struct misc_regs {
+   u64 dscr;
+   u64 ppr;
+   u64 tar;
+};
+
+/*
+ * do_tm_transaction
+ *
+ * This functions sets the values for TAR, DSCR, PPR, GPR[0..31],
+ * FPR[0..31] registers before starting the trasanction which will
+ * enable the kernel to save them as checkpointed values. Then it
+ * starts the transaction where it loads a different set of values
+ * into the same registers again thus enabling the kernel to save
+ * them off as running values for this transaction. Then the function
+ * gets stuck forcing the process to loop at one single instruction.
+ * The transaction never finishes, thus giving the parent process
+ * the opportunity to trace the running and checkpointed values of
+ * various registers.
+ */
+void do_tm_transaction(void)
+{
+   asm __volatile__(
+   /* TM checkpointed values */
+
+   /* SPR */
+   "li 0, %[c_tar];"   /* TAR */
+   "mtspr %[sprn_ta

[PATCH V7 1/8] elf: Add new powerpc specifc core note sections

2015-01-13 Thread Anshuman Khandual

This patch adds four new ELF core note sections for powerpc
transactional memory and one new ELF core note section for
powerpc general miscellaneous debug registers. These addition
of new ELF core note sections extends the existing ELF ABI
without affecting it in any manner.

Acked-by: Andrew Morton 
Signed-off-by: Anshuman Khandual 
---
 include/uapi/linux/elf.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 71e1d0e..0fd9983 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -379,6 +379,11 @@ typedef struct elf64_shdr {
 #define NT_PPC_VMX 0x100   /* PowerPC Altivec/VMX registers */
 #define NT_PPC_SPE 0x101   /* PowerPC SPE/EVR registers */
 #define NT_PPC_VSX 0x102   /* PowerPC VSX registers */
+#define NT_PPC_TM_SPR  0x103   /* PowerPC TM special registers */
+#define NT_PPC_TM_CGPR 0x104   /* PowerpC TM checkpointed GPR */
+#define NT_PPC_TM_CFPR 0x105   /* PowerPC TM checkpointed FPR */
+#define NT_PPC_TM_CVMX 0x106   /* PowerPC TM checkpointed VMX */
+#define NT_PPC_MISC0x107   /* PowerPC miscellaneous registers */
 #define NT_386_TLS 0x200   /* i386 TLS slots (struct user_desc) */
 #define NT_386_IOPERM  0x201   /* x86 io permission bitmap (1=deny) */
 #define NT_X86_XSTATE  0x202   /* x86 extended state using xsave */
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 00/12] POWER DSCR fixes, improvements, docs and tests

2015-01-13 Thread Anshuman Khandual

This patch series has patches for POWER DSCR fixes, improvements,
in code documentaion, kernel support user documentation and selftest based
test cases. It has got five test cases which are derived from Anton's DSCR
test bucket which can be listed as follows.

(1) http://ozlabs.org/~anton/junkcode/dscr_default_test.c
(2) http://ozlabs.org/~anton/junkcode/dscr_explicit_test.c
(3) http://ozlabs.org/~anton/junkcode/dscr_inherit_exec_test.c
(4) http://ozlabs.org/~anton/junkcode/dscr_inherit_test.c
(5) http://ozlabs.org/~anton/junkcode/user_dscr_test.c

So the derivied test cases have Anton's copyright along with mine but the
commit message as of now has only my signed-off-by statement. As Anton
mentioned before he would put his signed-off-by after reviewing these
modified test cases.

NOTE1: Anton's original inherit exec test expected the child to have system
default DSCR value instead of the inherited DSCR value from it's parent.
But looks like thats not the case when we execute the test, it always
inherits it's parent's DSCR value over the exec call as well. So I have
changed the test program assuming its correct to have the inherited DSCR
value in the fork/execed child program. Please let me know if thats not
correct and I am missing something there.

NOTE2: The selftests/powerpc/.gitignore will be added and will get updated
through a different patch series related to ptrace instead of this one.

Changes in V2:
-
- Updated the thread struct DSCR value inside mtspr facility exception path
- Modified the in code documentation to follow the kernel-doc format
- Added seven selftest based DSCR related test cases under powerpc

Original V1:

- Posted here at https://patchwork.ozlabs.org/patch/418583/

Anshuman Khandual (12):
  powerpc: Fix handling of DSCR related facility unavailable exception
  powerpc, process: Remove the unused extern dscr_default
  powerpc, offset: Change PACA_DSCR to PACA_DSCR_DEFAULT
  powerpc, dscr: Added some in-code documentation
  documentation, powerpc: Add documentation for DSCR support
  selftests, powerpc: Add test for system wide DSCR default
  selftests, powerpc: Add test for explicitly changing DSCR value
  selftests, powerpc: Add test for DSCR SPR numbers
  selftests, powerpc: Add test for DSCR value inheritence across fork
  selftests, powerpc: Add test for DSCR inheritence across fork & exec
  selftests, powerpc: Add test for all DSCR sysfs interfaces
  selftests, powerpc: Add thread based stress test for DSCR sysfs interfaces

 Documentation/powerpc/00-INDEX |   2 +
 Documentation/powerpc/dscr.txt |  83 ++
 arch/powerpc/include/asm/processor.h   |   9 ++
 arch/powerpc/kernel/asm-offsets.c  |   2 +-
 arch/powerpc/kernel/entry_64.S |   2 +-
 arch/powerpc/kernel/process.c  |   2 -
 arch/powerpc/kernel/sysfs.c|  38 +++
 arch/powerpc/kernel/tm.S   |   4 +-
 arch/powerpc/kernel/traps.c|  45 +++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S|   2 +-
 tools/testing/selftests/powerpc/Makefile   |   2 +-
 tools/testing/selftests/powerpc/dscr/Makefile  |  19 
 tools/testing/selftests/powerpc/dscr/dscr.h| 120 
 .../selftests/powerpc/dscr/dscr_default_test.c | 121 +
 .../selftests/powerpc/dscr/dscr_explicit_test.c|  72 
 .../powerpc/dscr/dscr_inherit_exec_test.c  | 118 
 .../selftests/powerpc/dscr/dscr_inherit_test.c |  96 
 .../selftests/powerpc/dscr/dscr_sysfs_test.c   |  89 +++
 .../powerpc/dscr/dscr_sysfs_thread_test.c  | 114 +++
 .../selftests/powerpc/dscr/dscr_user_test.c|  62 +++
 20 files changed, 989 insertions(+), 13 deletions(-)
 create mode 100644 Documentation/powerpc/dscr.txt
 create mode 100644 tools/testing/selftests/powerpc/dscr/Makefile
 create mode 100644 tools/testing/selftests/powerpc/dscr/dscr.h
 create mode 100644 tools/testing/selftests/powerpc/dscr/dscr_default_test.c
 create mode 100644 tools/testing/selftests/powerpc/dscr/dscr_explicit_test.c
 create mode 100644 
tools/testing/selftests/powerpc/dscr/dscr_inherit_exec_test.c
 create mode 100644 tools/testing/selftests/powerpc/dscr/dscr_inherit_test.c
 create mode 100644 tools/testing/selftests/powerpc/dscr/dscr_sysfs_test.c
 create mode 100644 
tools/testing/selftests/powerpc/dscr/dscr_sysfs_thread_test.c
 create mode 100644 tools/testing/selftests/powerpc/dscr/dscr_user_test.c

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 12/12] selftests, powerpc: Add thread based stress test for DSCR sysfs interfaces

2015-01-13 Thread Anshuman Khandual

This patch adds a test to update the system wide DSCR value repeatedly
and then verifies that any thread on any given CPU on the system must
be able to see the same DSCR value whether its is being read through
the problem state based SPR or the privilege state based SPR.

Signed-off-by: Anshuman Khandual 
---
 tools/testing/selftests/powerpc/dscr/Makefile  |   2 +-
 .../powerpc/dscr/dscr_sysfs_thread_test.c  | 114 +
 2 files changed, 115 insertions(+), 1 deletion(-)
 create mode 100644 
tools/testing/selftests/powerpc/dscr/dscr_sysfs_thread_test.c

diff --git a/tools/testing/selftests/powerpc/dscr/Makefile 
b/tools/testing/selftests/powerpc/dscr/Makefile
index fada526..834ef88 100644
--- a/tools/testing/selftests/powerpc/dscr/Makefile
+++ b/tools/testing/selftests/powerpc/dscr/Makefile
@@ -1,6 +1,6 @@
 PROGS := dscr_default_test dscr_explicit_test dscr_user_test   \
 dscr_inherit_test dscr_inherit_exec_test   \
-dscr_sysfs_test
+dscr_sysfs_test dscr_sysfs_thread_test
 
 CFLAGS := $(CFLAGS) -lpthread
 
diff --git a/tools/testing/selftests/powerpc/dscr/dscr_sysfs_thread_test.c 
b/tools/testing/selftests/powerpc/dscr/dscr_sysfs_thread_test.c
new file mode 100644
index 000..95a4304
--- /dev/null
+++ b/tools/testing/selftests/powerpc/dscr/dscr_sysfs_thread_test.c
@@ -0,0 +1,114 @@
+/*
+ * POWER Data Stream Control Register (DSCR) sysfs thread test
+ *
+ * This test updates the system wide DSCR default value through
+ * sysfs interface which should then update all the CPU specific
+ * DSCR default values which must also be then visible to threads
+ * executing on individual CPUs on the system.
+ *
+ * Copyright (C) 2015 Anshuman Khandual , IBM
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+#define _GNU_SOURCE
+#include "dscr.h"
+
+static pthread_mutex_t lock;   /* Pthread lock */
+static cpu_set_t cpuset;   /* Thread cpu set */
+static int target; /* Thread target cpu */
+static unsigned long *result;  /* Thread exit status array */
+
+static void *test_thread_dscr(void *in)
+{
+   unsigned long cur_dscr, cur_dscr_usr;
+   unsigned long val = (unsigned long)in;
+
+   pthread_mutex_lock(&lock);
+   target++;
+   if (target == sysconf(_SC_NPROCESSORS_ONLN))
+   target =  0;
+   CPU_ZERO(&cpuset);
+   CPU_SET(target, &cpuset);
+
+   if (sched_setaffinity(0, sizeof(cpuset), &cpuset)) {
+   perror("sched_settarget_cpu() failed\n");
+   pthread_mutex_unlock(&lock);
+   result[target] = 1;
+   pthread_exit(&result[target]);
+   }
+   pthread_mutex_unlock(&lock);
+
+   cur_dscr = get_dscr();
+   cur_dscr_usr = get_dscr_usr();
+
+   if (val != cur_dscr) {
+   printf("[cpu %d] Kernel DSCR should be %ld but is %ld\n",
+   sched_getcpu(), val, cur_dscr);
+   result[target] = 1;
+   pthread_exit(&result[target]);
+   }
+
+   if (val != cur_dscr_usr) {
+   printf("[cpu %d] User DSCR should be %ld but is %ld\n",
+   sched_getcpu(), val, cur_dscr_usr);
+   result[target] = 1;
+   pthread_exit(&result[target]);
+   }
+   result[target] = 0;
+   pthread_exit(&result[target]);
+}
+
+static int check_cpu_dscr_thread(unsigned long val)
+{
+   pthread_t threads[sysconf(_SC_NPROCESSORS_ONLN)];
+   unsigned long *status[sysconf(_SC_NPROCESSORS_ONLN)];
+   unsigned long i;
+
+   for (i = 0; i < sysconf(_SC_NPROCESSORS_ONLN); i++) {
+   if (pthread_create(&threads[i], NULL,
+   test_thread_dscr, (void *)val)) {
+   perror("pthread_create() failed\n");
+   return 1;
+   }
+   }
+
+   for (i = 0; i < sysconf(_SC_NPROCESSORS_ONLN); i++) {
+   if (pthread_join(threads[i], (void **)&(status[i]))) {
+   perror("pthread_join() failed\n");
+   return 1;
+   }
+
+   if (*status[i]) {
+   printf(" %ldth thread join failed with %ld\n",
+   i, *status[i]);
+   return 1;
+   }
+   }
+   return 0;
+}
+
+int test_body(void)
+{
+   int i, j;
+
+   result = malloc(sizeof(unsigned long) * sysconf(_SC_NPROCESSORS_ONLN));
+   pthread_mutex_init(&lock, NULL);
+   target = 0;
+   for (i = 0; i < COUNT; i++) {
+

[PATCH V2 11/12] selftests, powerpc: Add test for all DSCR sysfs interfaces

2015-01-13 Thread Anshuman Khandual

This test continuously updates the system wide DSCR default value
in the sysfs interface and makes sure that the same is reflected
across all the sysfs interfaces for each individual CPUs present
on the system.

Signed-off-by: Anshuman Khandual 
---
 tools/testing/selftests/powerpc/dscr/Makefile  |  3 +-
 .../selftests/powerpc/dscr/dscr_sysfs_test.c   | 89 ++
 2 files changed, 91 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/powerpc/dscr/dscr_sysfs_test.c

diff --git a/tools/testing/selftests/powerpc/dscr/Makefile 
b/tools/testing/selftests/powerpc/dscr/Makefile
index 4e84309..fada526 100644
--- a/tools/testing/selftests/powerpc/dscr/Makefile
+++ b/tools/testing/selftests/powerpc/dscr/Makefile
@@ -1,5 +1,6 @@
 PROGS := dscr_default_test dscr_explicit_test dscr_user_test   \
-dscr_inherit_test dscr_inherit_exec_test
+dscr_inherit_test dscr_inherit_exec_test   \
+dscr_sysfs_test
 
 CFLAGS := $(CFLAGS) -lpthread
 
diff --git a/tools/testing/selftests/powerpc/dscr/dscr_sysfs_test.c 
b/tools/testing/selftests/powerpc/dscr/dscr_sysfs_test.c
new file mode 100644
index 000..3d11439
--- /dev/null
+++ b/tools/testing/selftests/powerpc/dscr/dscr_sysfs_test.c
@@ -0,0 +1,89 @@
+/*
+ * POWER Data Stream Control Register (DSCR) sysfs interface test
+ *
+ * This test updates to system wide DSCR default through the sysfs interface
+ * and then verifies that all the CPU specific DSCR defaults are updated as
+ * well verified from their sysfs interfaces.
+ *
+ * Copyright (C) 2015 Anshuman Khandual , IBM
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+#include "dscr.h"
+
+static int check_cpu_dscr_default(char *file, unsigned long val)
+{
+   char buf[10];
+   int fd, rc;
+
+   fd = open(file, O_RDWR);
+   if (fd == -1) {
+   perror("open() failed\n");
+   return 1;
+   }
+
+   rc = read(fd, buf, sizeof(buf));
+   if (rc == -1) {
+   perror("read() failed\n");
+   return 1;
+   }
+   close(fd);
+
+   buf[rc] = '\0';
+   if (strtol(buf, NULL, 16) != val) {
+   printf("DSCR match failed: %ld (system) %ld (cpu)\n",
+   val, strtol(buf, NULL, 16));
+   return 1;
+   }
+   return 0;
+}
+
+static int check_all_cpu_dscr_defaults(unsigned long val)
+{
+   DIR *sysfs;
+   struct dirent *dp;
+   char file[LEN_MAX];
+
+   sysfs = opendir(CPU_PATH);
+   if (!sysfs) {
+   perror("opendir() failed\n");
+   return 1;
+   }
+
+   while ((dp = readdir(sysfs))) {
+   if (!(dp->d_type & DT_DIR))
+   continue;
+   if (!strcmp(dp->d_name, "cpuidle"))
+   continue;
+   if (!strstr(dp->d_name, "cpu"))
+   continue;
+
+   sprintf(file, "%s%s/dscr", CPU_PATH, dp->d_name);
+   if (check_cpu_dscr_default(file, val))
+   return 1;
+   }
+   closedir(sysfs);
+   return 0;
+}
+
+int test_body(void)
+{
+   int i, j;
+
+   for (i = 0; i < COUNT; i++) {
+   for (j = 0; j < DSCR_MAX; j++) {
+   set_default_dscr(j);
+   if (check_all_cpu_dscr_defaults(j))
+   return 1;
+   }
+   }
+   return 0;
+}
+
+int main(int argc, char *argv[])
+{
+   return test_harness(test_body, "dscr_sysfs_test");
+}
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 04/12] powerpc, dscr: Added some in-code documentation

2015-01-13 Thread Anshuman Khandual

This patch adds some in-code documentation to the DSCR related
code to make it more readable without having any functional
change to it.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/include/asm/processor.h |  9 +
 arch/powerpc/kernel/sysfs.c  | 38 
 2 files changed, 47 insertions(+)

diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index bf117d8..28ded5d 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -295,6 +295,15 @@ struct thread_struct {
 #endif
 #ifdef CONFIG_PPC64
unsigned long   dscr;
+   /*
+* This member element dscr_inherit indicates that the process
+* has explicitly attempted and changed the DSCR register value
+* for itself. Hence kernel wont use the default CPU DSCR value
+* contained in the PACA structure anymore during process context
+* switch. Once this variable is set, this behaviour will also be
+* inherited to all the children of this process from that point
+* onwards.
+*/
int dscr_inherit;
unsigned long   ppr;/* used to save/restore SMT priority */
 #endif
diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index fa1fd8a..692873b 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -496,13 +496,34 @@ static DEVICE_ATTR(spurr, 0400, show_spurr, NULL);
 static DEVICE_ATTR(purr, 0400, show_purr, store_purr);
 static DEVICE_ATTR(pir, 0400, show_pir, NULL);
 
+/*
+ * This is the system wide DSCR register default value. Any
+ * change to this default value through the sysfs interface
+ * will update all per cpu DSCR default values across the
+ * system stored in their respective PACA structures.
+ */
 static unsigned long dscr_default;
 
+/**
+ * read_dscr() - Fetch the cpu specific DSCR default
+ * @val:   Returned cpu specific DSCR default value
+ *
+ * This function returns the per cpu DSCR default value
+ * for any cpu which is contained in it's PACA structure.
+ */
 static void read_dscr(void *val)
 {
*(unsigned long *)val = get_paca()->dscr_default;
 }
 
+
+/**
+ * write_dscr() - Update the cpu specific DSCR default
+ * @val:   New cpu specific DSCR default value to update
+ *
+ * This function updates the per cpu DSCR default value
+ * for any cpu which is contained in it's PACA structure.
+ */
 static void write_dscr(void *val)
 {
get_paca()->dscr_default = *(unsigned long *)val;
@@ -520,12 +541,29 @@ static void add_write_permission_dev_attr(struct 
device_attribute *attr)
attr->attr.mode |= 0200;
 }
 
+/**
+ * show_dscr_default() - Fetch the system wide DSCR default
+ * @dev:   Device structure
+ * @attr:  Device attribute structure
+ * @buf:   Interface buffer
+ *
+ * This function returns the system wide DSCR default value.
+ */
 static ssize_t show_dscr_default(struct device *dev,
struct device_attribute *attr, char *buf)
 {
return sprintf(buf, "%lx\n", dscr_default);
 }
 
+/**
+ * store_dscr_default() - Update the system wide DSCR default
+ * @dev:   Device structure
+ * @attr:  Device attribute structure
+ * @buf:   Interface buffer
+ * @count: Size of the update
+ *
+ * This function updates the system wide DSCR default value.
+ */
 static ssize_t __used store_dscr_default(struct device *dev,
struct device_attribute *attr, const char *buf,
size_t count)
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 10/12] selftests, powerpc: Add test for DSCR inheritence across fork & exec

2015-01-13 Thread Anshuman Khandual

This patch adds a test case to verify that the changed DSCR value
inside any process would be inherited to it's child across the fork
and exec system call.

Signed-off-by: Anshuman Khandual 
---
 tools/testing/selftests/powerpc/dscr/Makefile  |   2 +-
 .../powerpc/dscr/dscr_inherit_exec_test.c  | 118 +
 2 files changed, 119 insertions(+), 1 deletion(-)
 create mode 100644 
tools/testing/selftests/powerpc/dscr/dscr_inherit_exec_test.c

diff --git a/tools/testing/selftests/powerpc/dscr/Makefile 
b/tools/testing/selftests/powerpc/dscr/Makefile
index 81239e2..4e84309 100644
--- a/tools/testing/selftests/powerpc/dscr/Makefile
+++ b/tools/testing/selftests/powerpc/dscr/Makefile
@@ -1,5 +1,5 @@
 PROGS := dscr_default_test dscr_explicit_test dscr_user_test   \
-dscr_inherit_test
+dscr_inherit_test dscr_inherit_exec_test
 
 CFLAGS := $(CFLAGS) -lpthread
 
diff --git a/tools/testing/selftests/powerpc/dscr/dscr_inherit_exec_test.c 
b/tools/testing/selftests/powerpc/dscr/dscr_inherit_exec_test.c
new file mode 100644
index 000..5ff3849
--- /dev/null
+++ b/tools/testing/selftests/powerpc/dscr/dscr_inherit_exec_test.c
@@ -0,0 +1,118 @@
+/*
+ * POWER Data Stream Control Register (DSCR) fork exec test
+ *
+ * This testcase modifies the DSCR using mtspr, forks & execs and
+ * verifies that the child is using the changed DSCR using mfspr.
+ *
+ * When using the privilege state SPR, the instructions such as
+ * mfspr or mtspr are priviledged and the kernel emulates them
+ * for us. Instructions using problem state SPR can be exuecuted
+ * directly without any emulation if the HW supports them. Else
+ * they also get emulated by the kernel.
+ *
+ * Copyright (C) 2012 Anton Blanchard , IBM
+ * Copyright (C) 2015 Anshuman Khandual , IBM
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+#include "dscr.h"
+
+static char prog[LEN_MAX];
+
+static void do_exec(unsigned long parent_dscr)
+{
+   unsigned long cur_dscr, cur_dscr_usr;
+
+   cur_dscr = get_dscr();
+   cur_dscr_usr = get_dscr_usr();
+
+   if (cur_dscr != parent_dscr) {
+   fprintf(stderr, "Parent DSCR %ld was not inherited "
+   "over exec (kernel value)\n", parent_dscr);
+   exit(1);
+   }
+
+   if (cur_dscr_usr != parent_dscr) {
+   fprintf(stderr, "Parent DSCR %ld was not inherited "
+   "over exec (user value)\n", parent_dscr);
+   exit(1);
+   }
+   exit(0);
+}
+
+int test_body(void)
+{
+   unsigned long i, dscr = 0;
+   pid_t pid;
+
+   for (i = 0; i < COUNT; i++) {
+   dscr++;
+   if (dscr > DSCR_MAX)
+   dscr = 0;
+
+   if (dscr == get_default_dscr())
+   continue;
+
+   if (i % 2 == 0)
+   set_dscr_usr(dscr);
+   else
+   set_dscr(dscr);
+
+   /*
+* XXX: Force a context switch out so that DSCR
+* current value is copied into the thread struct
+* which is required for the child to inherit the
+* changed value.
+*/
+   sleep(1);
+
+   pid = fork();
+   if (pid == -1) {
+   perror("fork() failed\n");
+   exit(1);
+   } else if (pid) {
+   int status;
+
+   if (waitpid(pid, &status, 0) == -1) {
+   perror("waitpid() failed\n");
+   exit(1);
+   }
+
+   if (!WIFEXITED(status)) {
+   fprintf(stderr, "Child didn't exit cleanly\n");
+   exit(1);
+   }
+
+   if (WEXITSTATUS(status) != 0) {
+   fprintf(stderr, "Child didn't exit cleanly\n");
+   return 1;
+   }
+   } else {
+   char dscr_str[16];
+
+   sprintf(dscr_str, "%ld", dscr);
+   execlp(prog, prog, "exec", dscr_str, NULL);
+   exit(1);
+   }
+   }
+   return 0;
+}
+
+int main(int argc, char *argv[])
+{
+   if (argc == 3 && !strcmp(argv[1], "exec")) {
+   unsigned long parent_dscr;
+
+   parent_dscr = atoi(argv[2]);
+   do_exec(parent_dscr);
+   } else if (argc != 1) {
+

[PATCH V2 09/12] selftests, powerpc: Add test for DSCR value inheritence across fork

2015-01-13 Thread Anshuman Khandual

This patch adds a test to verify that the changed DSCR value inside
any process would be inherited to it's child process across the fork
system call.

Signed-off-by: Anshuman Khandual 
---
 tools/testing/selftests/powerpc/dscr/Makefile  |  3 +-
 .../selftests/powerpc/dscr/dscr_inherit_test.c | 96 ++
 2 files changed, 98 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/powerpc/dscr/dscr_inherit_test.c

diff --git a/tools/testing/selftests/powerpc/dscr/Makefile 
b/tools/testing/selftests/powerpc/dscr/Makefile
index ae865d8..81239e2 100644
--- a/tools/testing/selftests/powerpc/dscr/Makefile
+++ b/tools/testing/selftests/powerpc/dscr/Makefile
@@ -1,4 +1,5 @@
-PROGS := dscr_default_test dscr_explicit_test dscr_user_test
+PROGS := dscr_default_test dscr_explicit_test dscr_user_test   \
+dscr_inherit_test
 
 CFLAGS := $(CFLAGS) -lpthread
 
diff --git a/tools/testing/selftests/powerpc/dscr/dscr_inherit_test.c 
b/tools/testing/selftests/powerpc/dscr/dscr_inherit_test.c
new file mode 100644
index 000..f1add77
--- /dev/null
+++ b/tools/testing/selftests/powerpc/dscr/dscr_inherit_test.c
@@ -0,0 +1,96 @@
+/*
+ * POWER Data Stream Control Register (DSCR) fork test
+ *
+ * This testcase modifies the DSCR using mtspr, forks and then
+ * verifies that the child process has the correct changed DSCR
+ * value using mfspr.
+ *
+ * When using the privilege state SPR, the instructions such as
+ * mfspr or mtspr are priviledged and the kernel emulates them
+ * for us. Instructions using problem state SPR can be exuecuted
+ * directly without any emulation if the HW supports them. Else
+ * they also get emulated by the kernel.
+ *
+ * Copyright (C) 2012 Anton Blanchard , IBM
+ * Copyright (C) 2015 Anshuman Khandual , IBM
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+#include "dscr.h"
+
+int test_body(void)
+{
+   unsigned long i, dscr = 0;
+   pid_t pid;
+
+   srand(getpid());
+   set_dscr(dscr);
+
+   for (i = 0; i < COUNT; i++) {
+   unsigned long cur_dscr, cur_dscr_usr;
+
+   dscr++;
+   if (dscr > DSCR_MAX)
+   dscr = 0;
+
+   if (i % 2 == 0)
+   set_dscr_usr(dscr);
+   else
+   set_dscr(dscr);
+
+   /*
+* XXX: Force a context switch out so that DSCR
+* current value is copied into the thread struct
+* which is required for the child to inherit the
+* changed value.
+*/
+   sleep(1);
+
+   pid = fork();
+   if (pid == -1) {
+   perror("fork() failed\n");
+   exit(1);
+   } else if (pid) {
+   int status;
+
+   if (waitpid(pid, &status, 0) == -1) {
+   perror("waitpid() failed\n");
+   exit(1);
+   }
+
+   if (!WIFEXITED(status)) {
+   fprintf(stderr, "Child didn't exit cleanly\n");
+   exit(1);
+   }
+
+   if (WEXITSTATUS(status) != 0) {
+   fprintf(stderr, "Child didn't exit cleanly\n");
+   return 1;
+   }
+   } else {
+   cur_dscr = get_dscr();
+   if (cur_dscr != dscr) {
+   fprintf(stderr, "Kernel DSCR should be %ld "
+   "but is %ld\n", dscr, cur_dscr);
+   exit(1);
+   }
+
+   cur_dscr_usr = get_dscr_usr();
+   if (cur_dscr_usr != dscr) {
+   fprintf(stderr, "User DSCR should be %ld "
+   "but is %ld\n", dscr, cur_dscr_usr);
+   exit(1);
+   }
+   exit(0);
+   }
+   }
+   return 0;
+}
+
+int main(int argc, char *argv[])
+{
+   return test_harness(test_body, "dscr_inherit_test");
+}
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 06/12] selftests, powerpc: Add test for system wide DSCR default

2015-01-13 Thread Anshuman Khandual

This patch adds a test case for the system wide DSCR default
value, which when changed through it's sysfs interface must
be visible to all threads reading DSCR either through the
privilege state SPR or the problem state SPR. The DSCR value
change should be immediate as well.

Signed-off-by: Anshuman Khandual 
---
 tools/testing/selftests/powerpc/Makefile   |   2 +-
 tools/testing/selftests/powerpc/dscr/Makefile  |  17 +++
 tools/testing/selftests/powerpc/dscr/dscr.h| 120 
 .../selftests/powerpc/dscr/dscr_default_test.c | 121 +
 4 files changed, 259 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/powerpc/dscr/Makefile
 create mode 100644 tools/testing/selftests/powerpc/dscr/dscr.h
 create mode 100644 tools/testing/selftests/powerpc/dscr/dscr_default_test.c

diff --git a/tools/testing/selftests/powerpc/Makefile 
b/tools/testing/selftests/powerpc/Makefile
index f6ff90a..1318883 100644
--- a/tools/testing/selftests/powerpc/Makefile
+++ b/tools/testing/selftests/powerpc/Makefile
@@ -13,7 +13,7 @@ CFLAGS := -Wall -O2 -flto -Wall -Werror 
-DGIT_VERSION='"$(GIT_VERSION)"' -I$(CUR
 
 export CC CFLAGS
 
-TARGETS = pmu copyloops mm tm primitives
+TARGETS = pmu copyloops mm tm primitives dscr
 
 endif
 
diff --git a/tools/testing/selftests/powerpc/dscr/Makefile 
b/tools/testing/selftests/powerpc/dscr/Makefile
new file mode 100644
index 000..0aa90ab
--- /dev/null
+++ b/tools/testing/selftests/powerpc/dscr/Makefile
@@ -0,0 +1,17 @@
+PROGS := dscr_default_test
+
+CFLAGS := $(CFLAGS) -lpthread
+
+all: $(PROGS)
+
+$(PROGS): ../harness.c
+
+run_tests: all
+   @-for PROG in $(PROGS); do \
+   ./$$PROG; \
+   done;
+
+clean:
+   rm -f $(PROGS) *.o
+
+.PHONY: all run_tests clean
diff --git a/tools/testing/selftests/powerpc/dscr/dscr.h 
b/tools/testing/selftests/powerpc/dscr/dscr.h
new file mode 100644
index 000..2e6535b
--- /dev/null
+++ b/tools/testing/selftests/powerpc/dscr/dscr.h
@@ -0,0 +1,120 @@
+/*
+ * POWER Data Stream Control Register (DSCR)
+ *
+ * This header file contains helper functions and macros
+ * required for all the DSCR related test cases.
+ *
+ * Copyright (C) 2012 Anton Blanchard , IBM
+ * Copyright (C) 2015 Anshuman Khandual , IBM
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+#ifndef _SELFTESTS_POWERPC_DSCR_DSCR_H
+#define _SELFTESTS_POWERPC_DSCR_DSCR_H
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+
+#define SPRN_DSCR  0x11/* Privilege state SPR */
+#define SPRN_DSCR_USR  0x03/* Problem state SPR */
+#define THREADS100 /* Max threads */
+#define COUNT  100 /* Max iterations */
+#define DSCR_MAX   16  /* Max DSCR value */
+#define LEN_MAX100 /* Max name length */
+
+#define DSCR_DEFAULT   "/sys/devices/system/cpu/dscr_default"
+#define CPU_PATH   "/sys/devices/system/cpu/"
+
+#define rmb()  asm volatile("lwsync":::"memory")
+#define wmb()  asm volatile("lwsync":::"memory")
+
+#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
+
+/* Prilvilege state DSCR access */
+inline unsigned long get_dscr(void)
+{
+   unsigned long ret;
+
+   asm volatile("mfspr %0,%1" : "=r" (ret): "i" (SPRN_DSCR));
+
+   return ret;
+}
+
+inline void set_dscr(unsigned long val)
+{
+   asm volatile("mtspr %1,%0" : : "r" (val), "i" (SPRN_DSCR));
+}
+
+/* Problem state DSCR access */
+inline unsigned long get_dscr_usr(void)
+{
+   unsigned long ret;
+
+   asm volatile("mfspr %0,%1" : "=r" (ret): "i" (SPRN_DSCR_USR));
+
+   return ret;
+}
+
+inline void set_dscr_usr(unsigned long val)
+{
+   asm volatile("mtspr %1,%0" : : "r" (val), "i" (SPRN_DSCR_USR));
+}
+
+/* Default DSCR access */
+unsigned long get_default_dscr(void)
+{
+   int fd = -1;
+   char buf[16];
+   unsigned long val;
+
+   if (fd == -1) {
+   fd = open(DSCR_DEFAULT, O_RDONLY);
+   if (fd == -1) {
+   perror("open() failed\n");
+   exit(1);
+   }
+   }
+   memset(buf, 0, sizeof(buf));
+   lseek(fd, 0, SEEK_SET);
+   read(fd, buf, sizeof(buf));
+   sscanf(buf, "%lx", &val);
+   close(fd);
+   return val;
+}
+
+void set_default_dscr(unsigned long val)
+{
+   int fd = -1;
+   char buf[16];
+
+   if (fd == -1) {
+   fd = open(DSCR_DEFAULT, O_RDWR);
+

[PATCH V2 07/12] selftests, powerpc: Add test for explicitly changing DSCR value

2015-01-13 Thread Anshuman Khandual

This patch adds a test which modifies the DSCR using mtspr instruction
and verifies the change using mfspr instruction. It uses both the
privilege state SPR as well as the problem state SPR for the purpose.

Signed-off-by: Anshuman Khandual 
---
 tools/testing/selftests/powerpc/dscr/Makefile  |  2 +-
 .../selftests/powerpc/dscr/dscr_explicit_test.c| 72 ++
 2 files changed, 73 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/powerpc/dscr/dscr_explicit_test.c

diff --git a/tools/testing/selftests/powerpc/dscr/Makefile 
b/tools/testing/selftests/powerpc/dscr/Makefile
index 0aa90ab..aede453 100644
--- a/tools/testing/selftests/powerpc/dscr/Makefile
+++ b/tools/testing/selftests/powerpc/dscr/Makefile
@@ -1,4 +1,4 @@
-PROGS := dscr_default_test
+PROGS := dscr_default_test dscr_explicit_test
 
 CFLAGS := $(CFLAGS) -lpthread
 
diff --git a/tools/testing/selftests/powerpc/dscr/dscr_explicit_test.c 
b/tools/testing/selftests/powerpc/dscr/dscr_explicit_test.c
new file mode 100644
index 000..5c0b7c1
--- /dev/null
+++ b/tools/testing/selftests/powerpc/dscr/dscr_explicit_test.c
@@ -0,0 +1,72 @@
+/*
+ * POWER Data Stream Control Register (DSCR) explicit test
+ *
+ * This test modifies the DSCR value using mtspr instruction and
+ * verifies the change with mfspr instruction. It uses both the
+ * privilege state SPR and the problem state SPR for this purpose.
+ *
+ * When using the privilege state SPR, the instructions such as
+ * mfspr or mtspr are priviledged and the kernel emulates them
+ * for us. Instructions using problem state SPR can be exuecuted
+ * directly without any emulation if the HW supports them. Else
+ * they also get emulated by the kernel.
+ *
+ * Copyright (C) 2012 Anton Blanchard , IBM
+ * Copyright (C) 2015 Anshuman Khandual , IBM
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+#include "dscr.h"
+
+int test_body(void)
+{
+   unsigned long i, dscr = 0;
+
+   srand(getpid());
+   set_dscr(dscr);
+
+   for (i = 0; i < COUNT; i++) {
+   unsigned long cur_dscr, cur_dscr_usr;
+   double ret = uniform_deviate(rand());
+
+   if (ret < 0.001) {
+   dscr++;
+   if (dscr > DSCR_MAX)
+   dscr = 0;
+
+   set_dscr(dscr);
+   }
+
+   cur_dscr = get_dscr();
+   if (cur_dscr != dscr) {
+   fprintf(stderr, "Kernel DSCR should be %ld but "
+   "is %ld\n", dscr, cur_dscr);
+   return 1;
+   }
+
+   ret = uniform_deviate(rand());
+   if (ret < 0.001) {
+   dscr++;
+   if (dscr > DSCR_MAX)
+   dscr = 0;
+
+   set_dscr_usr(dscr);
+   }
+
+   cur_dscr_usr = get_dscr_usr();
+   if (cur_dscr_usr != dscr) {
+   fprintf(stderr, "User DSCR should be %ld but "
+   "is %ld\n", dscr, cur_dscr_usr);
+   return 1;
+   }
+   }
+   return 0;
+}
+
+int main(int argc, char *argv[])
+{
+   return test_harness(test_body, "dscr_explicit_test");
+}
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 08/12] selftests, powerpc: Add test for DSCR SPR numbers

2015-01-13 Thread Anshuman Khandual

This patch adds a test which verifies that the DSCR privilege and
problem state SPR read & write accesses while making sure that the
results are always the same irrespective of which SPR number is
being used.

Signed-off-by: Anshuman Khandual 
---
 tools/testing/selftests/powerpc/dscr/Makefile  |  2 +-
 .../selftests/powerpc/dscr/dscr_user_test.c| 62 ++
 2 files changed, 63 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/powerpc/dscr/dscr_user_test.c

diff --git a/tools/testing/selftests/powerpc/dscr/Makefile 
b/tools/testing/selftests/powerpc/dscr/Makefile
index aede453..ae865d8 100644
--- a/tools/testing/selftests/powerpc/dscr/Makefile
+++ b/tools/testing/selftests/powerpc/dscr/Makefile
@@ -1,4 +1,4 @@
-PROGS := dscr_default_test dscr_explicit_test
+PROGS := dscr_default_test dscr_explicit_test dscr_user_test
 
 CFLAGS := $(CFLAGS) -lpthread
 
diff --git a/tools/testing/selftests/powerpc/dscr/dscr_user_test.c 
b/tools/testing/selftests/powerpc/dscr/dscr_user_test.c
new file mode 100644
index 000..f25d68e
--- /dev/null
+++ b/tools/testing/selftests/powerpc/dscr/dscr_user_test.c
@@ -0,0 +1,62 @@
+/*
+ * POWER Data Stream Control Register (DSCR) SPR test
+ *
+ * This test modifies the DSCR value through both the SPR number
+ * based mtspr instruction and then makes sure that the same is
+ * reflected through mfspr instruction using either of the SPR
+ * numbers.
+ *
+ * When using the privilege state SPR, the instructions such as
+ * mfspr or mtspr are priviledged and the kernel emulates them
+ * for us. Instructions using problem state SPR can be exuecuted
+ * directly without any emulation if the HW supports them. Else
+ * they also get emulated by the kernel.
+ *
+ * Copyright (C) 2013 Anton Blanchard , IBM
+ * Copyright (C) 2015 Anshuman Khandual , IBM
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+#include "dscr.h"
+
+static int check_dscr(char *str)
+{
+   unsigned long cur_dscr, cur_dscr_usr;
+
+   cur_dscr = get_dscr();
+   cur_dscr_usr = get_dscr_usr();
+   if (cur_dscr != cur_dscr_usr) {
+   printf("%s set, kernel get %lx != user get %lx\n",
+   str, cur_dscr, cur_dscr_usr);
+   return 1;
+   }
+   return 0;
+}
+
+int test_body(void)
+{
+   int i;
+
+   check_dscr("");
+
+   for (i = 0; i < COUNT; i++) {
+   set_dscr(i);
+   if (check_dscr("kernel"))
+   return 1;
+   }
+
+   for (i = 0; i < COUNT; i++) {
+   set_dscr_usr(i);
+   if (check_dscr("user"))
+   return 1;
+   }
+   return 0;
+}
+
+int main(int argc, char *argv[])
+{
+   return test_harness(test_body, "dscr_user_test");
+}
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 05/12] documentation, powerpc: Add documentation for DSCR support

2015-01-13 Thread Anshuman Khandual

This patch adds a new documentation file explaining the DSCR
support on powerpc platforms. This explains DSCR related data
structure, code paths and also available user interfaces. Any
further functional changes to the DSCR support in the kernel
should definitely update the documentation here.

Signed-off-by: Anshuman Khandual 
---
 Documentation/powerpc/00-INDEX |  2 +
 Documentation/powerpc/dscr.txt | 83 ++
 2 files changed, 85 insertions(+)
 create mode 100644 Documentation/powerpc/dscr.txt

diff --git a/Documentation/powerpc/00-INDEX b/Documentation/powerpc/00-INDEX
index 6fd0e8b..9dc845c 100644
--- a/Documentation/powerpc/00-INDEX
+++ b/Documentation/powerpc/00-INDEX
@@ -30,3 +30,5 @@ ptrace.txt
- Information on the ptrace interfaces for hardware debug registers.
 transactional_memory.txt
- Overview of the Power8 transactional memory support.
+dscr.txt
+   - Overview DSCR (Data Stream Control Register) support.
diff --git a/Documentation/powerpc/dscr.txt b/Documentation/powerpc/dscr.txt
new file mode 100644
index 000..1ff4400
--- /dev/null
+++ b/Documentation/powerpc/dscr.txt
@@ -0,0 +1,83 @@
+   DSCR (Data Stream Control Register)
+   
+
+DSCR register in powerpc allows user to have some control of prefetch of data
+stream in the processor. Please refer to the ISA documents or related manual
+for more detailed information regarding how to use this DSCR to attain this
+control of the pefetches . This document here provides an overview of kernel
+support for DSCR, related kernel objects, it's functionalities and exported
+user interface.
+
+(A) Data Structures:
+
+   (1) thread_struct:
+   dscr/* Thread DSCR value */
+   dscr_inherit/* Thread has changed default DSCR */
+
+   (2) PACA:
+   dscr_default/* per-CPU DSCR default value */
+
+   (3) sysfs.c:
+   dscr_default/* System DSCR default value */
+
+(B) Scheduler Changes:
+
+   Scheduler will write the per-CPU DSCR default which is stored in the
+   CPU's PACA value into the register if the thread has dscr_inherit value
+   cleared which means that it has not changed the default DSCR till now.
+   If the dscr_inherit value is set which means that it has changed the
+   default DSCR value, scheduler will write the changed value which will
+   now be contained in thread struct's dscr into the register instead of
+   the per-CPU default PACA based DSCR value.
+
+   NOTE: Please note here that the system wide global DSCR value never
+   gets used directly in the scheduler process context switch at all.
+
+(C) SYSFS Interface:
+
+   Global DSCR default:/sys/devices/system/cpu/dscr_default
+   CPU specific DSCR default:  /sys/devices/system/cpu/cpuN/dscr
+
+   Changing the global DSCR default in the sysfs will change all the CPU
+   specific DSCR defaults immediately in their PACA structures. Again if
+   the current process has the dscr_inherit clear, it also writes the new
+   value into every CPU's DSCR register right away and updates the current
+   thread's DSCR value as well.
+
+   Changing the CPU specif DSCR default value in the sysfs does exactly
+   the same thing as above but unlike the global one above, it just changes
+   stuff for that particular CPU instead for all the CPUs on the system.
+
+(D) User Space Instructions:
+
+   The DSCR register can be accessed in the user space using any of these
+   two SPR numbers available for that purpose.
+
+   (1) Problem state SPR:  0x03(Un-privileged, POWER8 only)
+   (2) Privileged state SPR:   0x11(Privileged)
+
+   Accessing DSCR through privileged SPR number (0x11) from user space
+   works, as it is emulated following an illegal instruction exception
+   inside the kernel. Both mfspr and mtspr instructions are emulated.
+
+   Accessing DSCR through user level SPR (0x03) from user space will first
+   create a facility unavailable exception. Inside this exception handler
+   all mfspr isntruction based read attempts will get emulated and returned
+   where as the first mtspr instruction based write attempts will enable
+   the DSCR facility for the next time around (both for read and write) by
+   setting DSCR facility in the FSCR register.
+
+(E) Specifics about 'dscr_inherit':
+
+   The thread struct element 'dscr_inherit' represents whether the thread
+   in question has attempted and changed the DSCR itself using any of the
+   following methods. This element signifies whether the thread wants to
+   use the CPU default DSCR value or its own changed DSCR value in the
+   kernel.
+
+   (1) mtspr instruction   (SPR number 0x03)

< 1 2 3 4 5 6 7 8 9 10 >

201 - 300 of 1966 matches

Mail list logo