Re: [PATCH] locking/qrwlock: Fix ordering in queued_write_lock_slowpath

2021-04-15 Thread Steve Capper



Hiya,

On 15/04/2021 17:45, Will Deacon wrote:

On Thu, Apr 15, 2021 at 04:26:46PM +, Ali Saidi wrote:


On Thu, 15 Apr 2021 16:02:29 +0100, Will Deacon wrote:

On Thu, Apr 15, 2021 at 02:25:52PM +, Ali Saidi wrote:

While this code is executed with the wait_lock held, a reader can
acquire the lock without holding wait_lock.  The writer side loops
checking the value with the atomic_cond_read_acquire(), but only truly
acquires the lock when the compare-and-exchange is completed
successfully which isn’t ordered. The other atomic operations from this
point are release-ordered and thus reads after the lock acquisition can
be completed before the lock is truly acquired which violates the
guarantees the lock should be making.


I think it would be worth spelling this out with an example. The issue
appears to be a concurrent reader in interrupt context taking and releasing
the lock in the window where the writer has returned from the
atomic_cond_read_acquire() but has not yet performed the cmpxchg(). Loads
can be speculated during this time, but the A-B-A of the lock word
from _QW_WAITING to (_QW_WAITING | _QR_BIAS) and back to _QW_WAITING allows
the atomic_cmpxchg_relaxed() to succeed. Is that right?


You're right. What we're seeing is an A-B-A problem that can allow
atomic_cond_read_acquire() to succeed and before the cmpxchg succeeds a reader
performs an A-B-A on the lock which allows the core to observe a read that
follows the cmpxchg ahead of the cmpxchg succeeding.

We've seen a problem in epoll where the reader does a xchg while
holding the read lock, but the writer can see a value change out from under it.

Writer   | Reader 2

ep_scan_ready_list() |
|- write_lock_irq()  |
 |- queued_write_lock_slowpath()  |
   |- atomic_cond_read_acquire()  |
  | read_lock_irqsave(>lock, flags);
  | chain_epi_lockless()
  |epi->next = xchg(>ovflist, epi);
  | read_unlock_irqrestore(>lock, 
flags);
  |
  atomic_cmpxchg_relaxed()|
   READ_ONCE(ep->ovflist);



Please stick this in the commit message, preferably annotated a bit like
Peter's example to show the READ_ONCE() being speculated.



I can confirm that this patch fixes a problem observed in
ep_scan_ready_list(.) whereby ovflist appeared to change when the write
lock was held.

So please feel free to add:
Tested-by: Steve Capper 

Also, I have spent a decent chunk of time looking at the above issue and
went through qrwlock, so FWIW, please feel free to add:
Reviewed-by: Steve Capper 

Cheers,
--
Steve


Re: "arm64/for-next/core" causes boot panic

2019-08-13 Thread Steve Capper
On Tue, Aug 13, 2019 at 03:04:52PM +0100, Steve Capper wrote:
> Hi Will,
> 
> On Tue, Aug 13, 2019 at 01:06:44PM +0100, Will Deacon wrote:
> > [+Steve]
> > 
> > On Tue, Aug 13, 2019 at 11:58:52AM +0100, Will Deacon wrote:
> > > On Tue, Aug 13, 2019 at 10:02:01AM +0100, Will Deacon wrote:
> > > > On Mon, Aug 12, 2019 at 05:51:35PM -0400, Qian Cai wrote:
> > > > > Booting today's linux-next on an arm64 server triggers a panic with
> > > > > CONFIG_KASAN_SW_TAGS=y pointing to this line,
> > > > 
> > > > Is this the only change on top of defconfig? If not, please can you 
> > > > share
> > > > your full .config?
> > > > 
> > > > > kfree()->virt_to_head_page()->compound_head()
> > > > > 
> > > > > unsigned long head = READ_ONCE(page->compound_head);
> > > > > 
> > > > > The bisect so far indicates one of those could be bad,
> > > > 
> > > > I guess that means the issue is reproducible on the arm64 for-next/core
> > > > branch. Once I have your .config, I'll give it a go.
> > > 
> > > FWIW, I've managed to reproduce this using defconfig + SW_TAGS on
> > > for-next/core, so I'll keep investigating.
> 
> I've installed clang-8 and enabled CONFIG_KASAN_SW_TAGS and was able to
> reproduce the problem quite rapidly. Many apologies for missing this
> before in my testing.
> 
> > 
> > Right, hacky diff below seems to resolve this, so I'll split this up into
> > some proper patches as there is more than one bug here.
> > 
> > Thanks,
> > 
> > Will
> > 
> > --->8
> > 
> > diff --git a/arch/arm64/include/asm/memory.h 
> > b/arch/arm64/include/asm/memory.h
> FWIW, this fixed the crashes I experienced, I'll run some additional
> tests.
> 

This works for me with 52-bit VAs + CONFIG_KASAN_SW_TAGS +
CONFIG_DEBUG_VIRTUAL + CONFIG_DEBUG_VM

FWIW:
Tested-by: Steve Capper 

Cheers,
-- 
Steve


Re: "arm64/for-next/core" causes boot panic

2019-08-13 Thread Steve Capper
Hi Will,

On Tue, Aug 13, 2019 at 01:06:44PM +0100, Will Deacon wrote:
> [+Steve]
> 
> On Tue, Aug 13, 2019 at 11:58:52AM +0100, Will Deacon wrote:
> > On Tue, Aug 13, 2019 at 10:02:01AM +0100, Will Deacon wrote:
> > > On Mon, Aug 12, 2019 at 05:51:35PM -0400, Qian Cai wrote:
> > > > Booting today's linux-next on an arm64 server triggers a panic with
> > > > CONFIG_KASAN_SW_TAGS=y pointing to this line,
> > > 
> > > Is this the only change on top of defconfig? If not, please can you share
> > > your full .config?
> > > 
> > > > kfree()->virt_to_head_page()->compound_head()
> > > > 
> > > > unsigned long head = READ_ONCE(page->compound_head);
> > > > 
> > > > The bisect so far indicates one of those could be bad,
> > > 
> > > I guess that means the issue is reproducible on the arm64 for-next/core
> > > branch. Once I have your .config, I'll give it a go.
> > 
> > FWIW, I've managed to reproduce this using defconfig + SW_TAGS on
> > for-next/core, so I'll keep investigating.

I've installed clang-8 and enabled CONFIG_KASAN_SW_TAGS and was able to
reproduce the problem quite rapidly. Many apologies for missing this
before in my testing.

> 
> Right, hacky diff below seems to resolve this, so I'll split this up into
> some proper patches as there is more than one bug here.
> 
> Thanks,
> 
> Will
> 
> --->8
> 
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
FWIW, this fixed the crashes I experienced, I'll run some additional
tests.

Cheers,
-- 
Steve


Re: [PATCH V6 3/3] arm64/mm: Enable memory hot remove

2019-06-21 Thread Steve Capper
Hi Anshuman,

On Wed, Jun 19, 2019 at 09:47:40AM +0530, Anshuman Khandual wrote:
> The arch code for hot-remove must tear down portions of the linear map and
> vmemmap corresponding to memory being removed. In both cases the page
> tables mapping these regions must be freed, and when sparse vmemmap is in
> use the memory backing the vmemmap must also be freed.
> 
> This patch adds a new remove_pagetable() helper which can be used to tear
> down either region, and calls it from vmemmap_free() and
> ___remove_pgd_mapping(). The sparse_vmap argument determines whether the
> backing memory will be freed.
> 
> remove_pagetable() makes two distinct passes over the kernel page table.
> In the first pass it unmaps, invalidates applicable TLB cache and frees
> backing memory if required (vmemmap) for each mapped leaf entry. In the
> second pass it looks for empty page table sections whose page table page
> can be unmapped, TLB invalidated and freed.
> 
> While freeing intermediate level page table pages bail out if any of its
> entries are still valid. This can happen for partially filled kernel page
> table either from a previously attempted failed memory hot add or while
> removing an address range which does not span the entire page table page
> range.
> 
> The vmemmap region may share levels of table with the vmalloc region.
> There can be conflicts between hot remove freeing page table pages with
> a concurrent vmalloc() walking the kernel page table. This conflict can
> not just be solved by taking the init_mm ptl because of existing locking
> scheme in vmalloc(). Hence unlike linear mapping, skip freeing page table
> pages while tearing down vmemmap mapping.
> 
> While here update arch_add_memory() to handle __add_pages() failures by
> just unmapping recently added kernel linear mapping. Now enable memory hot
> remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.
> 
> This implementation is overall inspired from kernel page table tear down
> procedure on X86 architecture.
> 
> Acked-by: David Hildenbrand 
> Signed-off-by: Anshuman Khandual 
> ---

FWIW:
Acked-by: Steve Capper 

One minor comment below though.

>  arch/arm64/Kconfig  |   3 +
>  arch/arm64/mm/mmu.c | 290 
> ++--
>  2 files changed, 284 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 6426f48..9375f26 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -270,6 +270,9 @@ config HAVE_GENERIC_GUP
>  config ARCH_ENABLE_MEMORY_HOTPLUG
>   def_bool y
>  
> +config ARCH_ENABLE_MEMORY_HOTREMOVE
> + def_bool y
> +
>  config SMP
>   def_bool y
>  
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 93ed0df..9e80a94 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -733,6 +733,250 @@ int kern_addr_valid(unsigned long addr)
>  
>   return pfn_valid(pte_pfn(pte));
>  }
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +static void free_hotplug_page_range(struct page *page, size_t size)
> +{
> + WARN_ON(!page || PageReserved(page));
> + free_pages((unsigned long)page_address(page), get_order(size));
> +}

We are dealing with power of 2 number of pages, it makes a lot more
sense (to me) to replace the size parameter with order.

Also, all the callers are for known compile-time sizes, so we could just
translate the size parameter as follows to remove any usage of get_order?
PAGE_SIZE -> 0
PMD_SIZE -> PMD_SHIFT - PAGE_SHIFT
PUD_SIZE -> PUD_SHIFT - PAGE_SHIFT

Cheers,
-- 
Steve


Re: [PATCH] arm64: hugetlb: Register hugepages during arch init

2018-12-06 Thread Steve Capper
On Tue, Oct 23, 2018 at 06:36:57AM +0530, Allen Pais wrote:
> Add hstate for each supported hugepage size using arch initcall.
> 
> * no hugepage parameters
> 
>   Without hugepage parameters, only a default hugepage size is
>   available for dynamic allocation.  It's different, for example, from
>   x86_64 and sparc64 where all supported hugepage sizes are available.
> 
> * only default_hugepagesz= is specified and set not to HPAGE_SIZE
> 
>   In spite of the fact that default_hugepagesz= is set to a valid
>   hugepage size, it's treated as unsupported and reverted to
>   HPAGE_SIZE.  Such behaviour is also different from x86_64 and
>   sparc64.
> 
> Reviewed-by: Tom Saeger 
> Signed-off-by: Dmitry Klochkov 
> Signed-off-by: Allen Pais 
> ---
>  arch/arm64/mm/hugetlbpage.c | 33 ++---
>  1 file changed, 22 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index f58ea50..28cbc22 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -429,6 +429,27 @@ void huge_ptep_clear_flush(struct vm_area_struct *vma,
>   clear_flush(vma->vm_mm, addr, ptep, pgsize, ncontig);
>  }
>  
> +static void __init add_huge_page_size(unsigned long size)
> +{
> + if (size_to_hstate(size))
> + return;
> +
> + hugetlb_add_hstate(ilog2(size) - PAGE_SHIFT);
> +}
> +
> +static int __init hugetlbpage_init(void)
> +{
> +#ifdef CONFIG_ARM64_4K_PAGES
> + add_huge_page_size(PUD_SIZE);
> +#endif
> + add_huge_page_size(PMD_SIZE * CONT_PMDS);
> + add_huge_page_size(PMD_SIZE);
> + add_huge_page_size(PAGE_SIZE * CONT_PTES);
> +
> + return 0;
> +}
> +arch_initcall(hugetlbpage_init);
> +
>  static __init int setup_hugepagesz(char *opt)
>  {
>   unsigned long ps = memparse(opt, );
> @@ -440,7 +461,7 @@ static __init int setup_hugepagesz(char *opt)
>   case PMD_SIZE * CONT_PMDS:
>   case PMD_SIZE:
>   case PAGE_SIZE * CONT_PTES:
> - hugetlb_add_hstate(ilog2(ps) - PAGE_SHIFT);
> + add_huge_page_size(ps);
>   return 1;
>   }
>  
> @@ -449,13 +470,3 @@ static __init int setup_hugepagesz(char *opt)
>   return 0;
>  }
>  __setup("hugepagesz=", setup_hugepagesz);
> -
> -#ifdef CONFIG_ARM64_64K_PAGES
> -static __init int add_default_hugepagesz(void)
> -{
> - if (size_to_hstate(CONT_PTES * PAGE_SIZE) == NULL)
> - hugetlb_add_hstate(CONT_PTE_SHIFT);
> - return 0;
> -}
> -arch_initcall(add_default_hugepagesz);
> -#endif
> -- 
> 1.8.3.1
> 


Apologies for missing this, I like the idea of having all the hugetlb
sizes accessible right away.

FWIW:
Acked-by: Steve Capper 

Cheers,
-- 
Steve


Re: [PATCH] arm64: hugetlb: Register hugepages during arch init

2018-12-06 Thread Steve Capper
On Tue, Oct 23, 2018 at 06:36:57AM +0530, Allen Pais wrote:
> Add hstate for each supported hugepage size using arch initcall.
> 
> * no hugepage parameters
> 
>   Without hugepage parameters, only a default hugepage size is
>   available for dynamic allocation.  It's different, for example, from
>   x86_64 and sparc64 where all supported hugepage sizes are available.
> 
> * only default_hugepagesz= is specified and set not to HPAGE_SIZE
> 
>   In spite of the fact that default_hugepagesz= is set to a valid
>   hugepage size, it's treated as unsupported and reverted to
>   HPAGE_SIZE.  Such behaviour is also different from x86_64 and
>   sparc64.
> 
> Reviewed-by: Tom Saeger 
> Signed-off-by: Dmitry Klochkov 
> Signed-off-by: Allen Pais 
> ---
>  arch/arm64/mm/hugetlbpage.c | 33 ++---
>  1 file changed, 22 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index f58ea50..28cbc22 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -429,6 +429,27 @@ void huge_ptep_clear_flush(struct vm_area_struct *vma,
>   clear_flush(vma->vm_mm, addr, ptep, pgsize, ncontig);
>  }
>  
> +static void __init add_huge_page_size(unsigned long size)
> +{
> + if (size_to_hstate(size))
> + return;
> +
> + hugetlb_add_hstate(ilog2(size) - PAGE_SHIFT);
> +}
> +
> +static int __init hugetlbpage_init(void)
> +{
> +#ifdef CONFIG_ARM64_4K_PAGES
> + add_huge_page_size(PUD_SIZE);
> +#endif
> + add_huge_page_size(PMD_SIZE * CONT_PMDS);
> + add_huge_page_size(PMD_SIZE);
> + add_huge_page_size(PAGE_SIZE * CONT_PTES);
> +
> + return 0;
> +}
> +arch_initcall(hugetlbpage_init);
> +
>  static __init int setup_hugepagesz(char *opt)
>  {
>   unsigned long ps = memparse(opt, );
> @@ -440,7 +461,7 @@ static __init int setup_hugepagesz(char *opt)
>   case PMD_SIZE * CONT_PMDS:
>   case PMD_SIZE:
>   case PAGE_SIZE * CONT_PTES:
> - hugetlb_add_hstate(ilog2(ps) - PAGE_SHIFT);
> + add_huge_page_size(ps);
>   return 1;
>   }
>  
> @@ -449,13 +470,3 @@ static __init int setup_hugepagesz(char *opt)
>   return 0;
>  }
>  __setup("hugepagesz=", setup_hugepagesz);
> -
> -#ifdef CONFIG_ARM64_64K_PAGES
> -static __init int add_default_hugepagesz(void)
> -{
> - if (size_to_hstate(CONT_PTES * PAGE_SIZE) == NULL)
> - hugetlb_add_hstate(CONT_PTE_SHIFT);
> - return 0;
> -}
> -arch_initcall(add_default_hugepagesz);
> -#endif
> -- 
> 1.8.3.1
> 


Apologies for missing this, I like the idea of having all the hugetlb
sizes accessible right away.

FWIW:
Acked-by: Steve Capper 

Cheers,
-- 
Steve


Re: [PATCH V3 5/5] arm64/mm: Enable HugeTLB migration for contiguous bit HugeTLB pages

2018-11-08 Thread Steve Capper
On Tue, Oct 23, 2018 at 06:32:01PM +0530, Anshuman Khandual wrote:
> Let arm64 subscribe to the previously added framework in which architecture
> can inform whether a given huge page size is supported for migration. This
> just overrides the default function arch_hugetlb_migration_supported() and
> enables migration for all possible HugeTLB page sizes on arm64. With this,
> HugeTLB migration support on arm64 now covers all possible HugeTLB options.
> 
> CONT PTEPMDCONT PMDPUD
> ------
> 4K:64K  2M32M  1G
> 16K:2M 32M 1G
> 64K:2M512M16G
> 
> Reviewed-by: Naoya Horiguchi 
> Signed-off-by: Anshuman Khandual 

Reviewed-by: Steve Capper 

> ---
>  arch/arm64/include/asm/hugetlb.h |  5 +
>  arch/arm64/mm/hugetlbpage.c  | 20 
>  2 files changed, 25 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/hugetlb.h 
> b/arch/arm64/include/asm/hugetlb.h
> index e73f685..656f70e 100644
> --- a/arch/arm64/include/asm/hugetlb.h
> +++ b/arch/arm64/include/asm/hugetlb.h
> @@ -20,6 +20,11 @@
>  
>  #include 
>  
> +#ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
> +#define arch_hugetlb_migration_supported arch_hugetlb_migration_supported
> +extern bool arch_hugetlb_migration_supported(struct hstate *h);
> +#endif
> +
>  static inline pte_t huge_ptep_get(pte_t *ptep)
>  {
>   return READ_ONCE(*ptep);
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 21512ca..f3afdcf 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -27,6 +27,26 @@
>  #include 
>  #include 
>  
> +#ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
> +bool arch_hugetlb_migration_supported(struct hstate *h)
> +{
> + size_t pagesize = huge_page_size(h);
> +
> + switch (pagesize) {
> +#ifdef CONFIG_ARM64_4K_PAGES
> + case PUD_SIZE:
> +#endif
> + case PMD_SIZE:
> + case CONT_PMD_SIZE:
> + case CONT_PTE_SIZE:
> + return true;
> + }
> + pr_warn("%s: unrecognized huge page size 0x%lx\n",
> + __func__, pagesize);
> + return false;
> +}
> +#endif
> +
>  int pmd_huge(pmd_t pmd)
>  {
>   return pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT);
> -- 
> 2.7.4
> 


Re: [PATCH V3 5/5] arm64/mm: Enable HugeTLB migration for contiguous bit HugeTLB pages

2018-11-08 Thread Steve Capper
On Tue, Oct 23, 2018 at 06:32:01PM +0530, Anshuman Khandual wrote:
> Let arm64 subscribe to the previously added framework in which architecture
> can inform whether a given huge page size is supported for migration. This
> just overrides the default function arch_hugetlb_migration_supported() and
> enables migration for all possible HugeTLB page sizes on arm64. With this,
> HugeTLB migration support on arm64 now covers all possible HugeTLB options.
> 
> CONT PTEPMDCONT PMDPUD
> ------
> 4K:64K  2M32M  1G
> 16K:2M 32M 1G
> 64K:2M512M16G
> 
> Reviewed-by: Naoya Horiguchi 
> Signed-off-by: Anshuman Khandual 

Reviewed-by: Steve Capper 

> ---
>  arch/arm64/include/asm/hugetlb.h |  5 +
>  arch/arm64/mm/hugetlbpage.c  | 20 
>  2 files changed, 25 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/hugetlb.h 
> b/arch/arm64/include/asm/hugetlb.h
> index e73f685..656f70e 100644
> --- a/arch/arm64/include/asm/hugetlb.h
> +++ b/arch/arm64/include/asm/hugetlb.h
> @@ -20,6 +20,11 @@
>  
>  #include 
>  
> +#ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
> +#define arch_hugetlb_migration_supported arch_hugetlb_migration_supported
> +extern bool arch_hugetlb_migration_supported(struct hstate *h);
> +#endif
> +
>  static inline pte_t huge_ptep_get(pte_t *ptep)
>  {
>   return READ_ONCE(*ptep);
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 21512ca..f3afdcf 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -27,6 +27,26 @@
>  #include 
>  #include 
>  
> +#ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
> +bool arch_hugetlb_migration_supported(struct hstate *h)
> +{
> + size_t pagesize = huge_page_size(h);
> +
> + switch (pagesize) {
> +#ifdef CONFIG_ARM64_4K_PAGES
> + case PUD_SIZE:
> +#endif
> + case PMD_SIZE:
> + case CONT_PMD_SIZE:
> + case CONT_PTE_SIZE:
> + return true;
> + }
> + pr_warn("%s: unrecognized huge page size 0x%lx\n",
> + __func__, pagesize);
> + return false;
> +}
> +#endif
> +
>  int pmd_huge(pmd_t pmd)
>  {
>   return pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT);
> -- 
> 2.7.4
> 


Re: [PATCH V3 3/5] mm/hugetlb: Enable arch specific huge page size support for migration

2018-11-08 Thread Steve Capper
On Tue, Oct 23, 2018 at 06:31:59PM +0530, Anshuman Khandual wrote:
> Architectures like arm64 have HugeTLB page sizes which are different than
> generic sizes at PMD, PUD, PGD level and implemented via contiguous bits.
> At present these special size HugeTLB pages cannot be identified through
> macros like (PMD|PUD|PGDIR)_SHIFT and hence chosen not be migrated.
> 
> Enabling migration support for these special HugeTLB page sizes along with
> the generic ones (PMD|PUD|PGD) would require identifying all of them on a
> given platform. A platform specific hook can precisely enumerate all huge
> page sizes supported for migration. Instead of comparing against standard
> huge page orders let hugetlb_migration_support() function call a platform
> hook arch_hugetlb_migration_support(). Default definition for the platform
> hook maintains existing semantics which checks standard huge page order.
> But an architecture can choose to override the default and provide support
> for a comprehensive set of huge page sizes.
> 
> Reviewed-by: Naoya Horiguchi 
> Signed-off-by: Anshuman Khandual 

Reviewed-by: Steve Capper 

> ---
>  include/linux/hugetlb.h | 15 +--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 70bcd89..4cc3871 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -493,18 +493,29 @@ static inline pgoff_t basepage_index(struct page *page)
>  extern int dissolve_free_huge_page(struct page *page);
>  extern int dissolve_free_huge_pages(unsigned long start_pfn,
>   unsigned long end_pfn);
> -static inline bool hugepage_migration_supported(struct hstate *h)
> -{
> +
>  #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
> +#ifndef arch_hugetlb_migration_supported
> +static inline bool arch_hugetlb_migration_supported(struct hstate *h)
> +{
>   if ((huge_page_shift(h) == PMD_SHIFT) ||
>   (huge_page_shift(h) == PUD_SHIFT) ||
>   (huge_page_shift(h) == PGDIR_SHIFT))
>   return true;
>   else
>   return false;
> +}
> +#endif
>  #else
> +static inline bool arch_hugetlb_migration_supported(struct hstate *h)
> +{
>   return false;
> +}
>  #endif
> +
> +static inline bool hugepage_migration_supported(struct hstate *h)
> +{
> + return arch_hugetlb_migration_supported(h);
>  }
>  
>  /*
> -- 
> 2.7.4
> 


Re: [PATCH V3 3/5] mm/hugetlb: Enable arch specific huge page size support for migration

2018-11-08 Thread Steve Capper
On Tue, Oct 23, 2018 at 06:31:59PM +0530, Anshuman Khandual wrote:
> Architectures like arm64 have HugeTLB page sizes which are different than
> generic sizes at PMD, PUD, PGD level and implemented via contiguous bits.
> At present these special size HugeTLB pages cannot be identified through
> macros like (PMD|PUD|PGDIR)_SHIFT and hence chosen not be migrated.
> 
> Enabling migration support for these special HugeTLB page sizes along with
> the generic ones (PMD|PUD|PGD) would require identifying all of them on a
> given platform. A platform specific hook can precisely enumerate all huge
> page sizes supported for migration. Instead of comparing against standard
> huge page orders let hugetlb_migration_support() function call a platform
> hook arch_hugetlb_migration_support(). Default definition for the platform
> hook maintains existing semantics which checks standard huge page order.
> But an architecture can choose to override the default and provide support
> for a comprehensive set of huge page sizes.
> 
> Reviewed-by: Naoya Horiguchi 
> Signed-off-by: Anshuman Khandual 

Reviewed-by: Steve Capper 

> ---
>  include/linux/hugetlb.h | 15 +--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 70bcd89..4cc3871 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -493,18 +493,29 @@ static inline pgoff_t basepage_index(struct page *page)
>  extern int dissolve_free_huge_page(struct page *page);
>  extern int dissolve_free_huge_pages(unsigned long start_pfn,
>   unsigned long end_pfn);
> -static inline bool hugepage_migration_supported(struct hstate *h)
> -{
> +
>  #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
> +#ifndef arch_hugetlb_migration_supported
> +static inline bool arch_hugetlb_migration_supported(struct hstate *h)
> +{
>   if ((huge_page_shift(h) == PMD_SHIFT) ||
>   (huge_page_shift(h) == PUD_SHIFT) ||
>   (huge_page_shift(h) == PGDIR_SHIFT))
>   return true;
>   else
>   return false;
> +}
> +#endif
>  #else
> +static inline bool arch_hugetlb_migration_supported(struct hstate *h)
> +{
>   return false;
> +}
>  #endif
> +
> +static inline bool hugepage_migration_supported(struct hstate *h)
> +{
> + return arch_hugetlb_migration_supported(h);
>  }
>  
>  /*
> -- 
> 2.7.4
> 


Re: [PATCH V3 4/5] arm64/mm: Enable HugeTLB migration

2018-11-08 Thread Steve Capper
On Tue, Oct 23, 2018 at 06:32:00PM +0530, Anshuman Khandual wrote:
> Let arm64 subscribe to generic HugeTLB page migration framework. Right now
> this only works on the following PMD and PUD level HugeTLB page sizes with
> various kernel base page size combinations.
> 
>CONT PTEPMDCONT PMDPUD
>------
> 4K: NA 2M NA  1G
> 16K:NA32M NA
> 64K:NA   512M NA
> 
> Reviewed-by: Naoya Horiguchi 
> Signed-off-by: Anshuman Khandual 


Reviewed-by: Steve Capper 

> ---
>  arch/arm64/Kconfig | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index a8ae30f..4b3e269 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1331,6 +1331,10 @@ config SYSVIPC_COMPAT
>   def_bool y
>   depends on COMPAT && SYSVIPC
>  
> +config ARCH_ENABLE_HUGEPAGE_MIGRATION
> + def_bool y
> + depends on HUGETLB_PAGE && MIGRATION
> +
>  menu "Power management options"
>  
>  source "kernel/power/Kconfig"
> -- 
> 2.7.4
> 


Re: [PATCH V3 4/5] arm64/mm: Enable HugeTLB migration

2018-11-08 Thread Steve Capper
On Tue, Oct 23, 2018 at 06:32:00PM +0530, Anshuman Khandual wrote:
> Let arm64 subscribe to generic HugeTLB page migration framework. Right now
> this only works on the following PMD and PUD level HugeTLB page sizes with
> various kernel base page size combinations.
> 
>CONT PTEPMDCONT PMDPUD
>------
> 4K: NA 2M NA  1G
> 16K:NA32M NA
> 64K:NA   512M NA
> 
> Reviewed-by: Naoya Horiguchi 
> Signed-off-by: Anshuman Khandual 


Reviewed-by: Steve Capper 

> ---
>  arch/arm64/Kconfig | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index a8ae30f..4b3e269 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1331,6 +1331,10 @@ config SYSVIPC_COMPAT
>   def_bool y
>   depends on COMPAT && SYSVIPC
>  
> +config ARCH_ENABLE_HUGEPAGE_MIGRATION
> + def_bool y
> + depends on HUGETLB_PAGE && MIGRATION
> +
>  menu "Power management options"
>  
>  source "kernel/power/Kconfig"
> -- 
> 2.7.4
> 


Re: [PATCH V3 2/5] mm/hugetlb: Enable PUD level huge page migration

2018-11-08 Thread Steve Capper
On Tue, Oct 23, 2018 at 06:31:58PM +0530, Anshuman Khandual wrote:
> Architectures like arm64 have PUD level HugeTLB pages for certain configs
> (1GB huge page is PUD based on ARM64_4K_PAGES base page size) that can be
> enabled for migration. It can be achieved through checking for PUD_SHIFT
> order based HugeTLB pages during migration.
> 
> Reviewed-by: Naoya Horiguchi 
> Signed-off-by: Anshuman Khandual 

Reviewed-by: Steve Capper 

> ---
>  include/linux/hugetlb.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1b858d7..70bcd89 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -497,7 +497,8 @@ static inline bool hugepage_migration_supported(struct 
> hstate *h)
>  {
>  #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
>   if ((huge_page_shift(h) == PMD_SHIFT) ||
> - (huge_page_shift(h) == PGDIR_SHIFT))
> + (huge_page_shift(h) == PUD_SHIFT) ||
> + (huge_page_shift(h) == PGDIR_SHIFT))
>   return true;
>   else
>   return false;
> -- 
> 2.7.4
> 


Re: [PATCH V3 2/5] mm/hugetlb: Enable PUD level huge page migration

2018-11-08 Thread Steve Capper
On Tue, Oct 23, 2018 at 06:31:58PM +0530, Anshuman Khandual wrote:
> Architectures like arm64 have PUD level HugeTLB pages for certain configs
> (1GB huge page is PUD based on ARM64_4K_PAGES base page size) that can be
> enabled for migration. It can be achieved through checking for PUD_SHIFT
> order based HugeTLB pages during migration.
> 
> Reviewed-by: Naoya Horiguchi 
> Signed-off-by: Anshuman Khandual 

Reviewed-by: Steve Capper 

> ---
>  include/linux/hugetlb.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1b858d7..70bcd89 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -497,7 +497,8 @@ static inline bool hugepage_migration_supported(struct 
> hstate *h)
>  {
>  #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
>   if ((huge_page_shift(h) == PMD_SHIFT) ||
> - (huge_page_shift(h) == PGDIR_SHIFT))
> + (huge_page_shift(h) == PUD_SHIFT) ||
> + (huge_page_shift(h) == PGDIR_SHIFT))
>   return true;
>   else
>   return false;
> -- 
> 2.7.4
> 


Re: [PATCH V3 1/5] mm/hugetlb: Distinguish between migratability and movability

2018-11-08 Thread Steve Capper
Hi Anshuman,

On Tue, Oct 23, 2018 at 06:31:57PM +0530, Anshuman Khandual wrote:
> During huge page allocation it's migratability is checked to determine if
> it should be placed under movable zones with GFP_HIGHUSER_MOVABLE. But the
> movability aspect of the huge page could depend on other factors than just
> migratability. Movability in itself is a distinct property which should not
> be tied with migratability alone.
> 
> This differentiates these two and implements an enhanced movability check
> which also considers huge page size to determine if it is feasible to be
> placed under a movable zone. At present it just checks for gigantic pages
> but going forward it can incorporate other enhanced checks.
> 
> Reviewed-by: Naoya Horiguchi 
> Suggested-by: Michal Hocko 
> Signed-off-by: Anshuman Khandual 

FWIW:
Reviewed-by: Steve Capper 


Re: [PATCH V3 1/5] mm/hugetlb: Distinguish between migratability and movability

2018-11-08 Thread Steve Capper
Hi Anshuman,

On Tue, Oct 23, 2018 at 06:31:57PM +0530, Anshuman Khandual wrote:
> During huge page allocation it's migratability is checked to determine if
> it should be placed under movable zones with GFP_HIGHUSER_MOVABLE. But the
> movability aspect of the huge page could depend on other factors than just
> migratability. Movability in itself is a distinct property which should not
> be tied with migratability alone.
> 
> This differentiates these two and implements an enhanced movability check
> which also considers huge page size to determine if it is feasible to be
> placed under a movable zone. At present it just checks for gigantic pages
> but going forward it can incorporate other enhanced checks.
> 
> Reviewed-by: Naoya Horiguchi 
> Suggested-by: Michal Hocko 
> Signed-off-by: Anshuman Khandual 

FWIW:
Reviewed-by: Steve Capper 


Re: [PATCH V2] arm64: hwpoison: add VM_FAULT_HWPOISON[_LARGE] handling

2017-03-15 Thread Steve Capper
Hi,
Sorry for replying to this thread late.

On 15 March 2017 at 11:19, Catalin Marinas <catalin.mari...@arm.com> wrote:
> Hi Punit,
>
> Adding David Woods since he seems to have added the arm64-specific
> huge_pte_offset() code.
>
> On Thu, Mar 09, 2017 at 05:46:36PM +, Punit Agrawal wrote:
>> From d5ad3f428e629c80b0f93f2bbdf99b4cae28c9bc Mon Sep 17 00:00:00 2001
>> From: Punit Agrawal <punit.agra...@arm.com>
>> Date: Thu, 9 Mar 2017 16:16:29 +
>> Subject: [PATCH] arm64: hugetlb: Fix huge_pte_offset to return poisoned pmd
>>
>> When memory failure is enabled, a poisoned hugepage PMD is marked as a
>> swap entry. As pmd_present() only checks for VALID and PROT_NONE
>> bits (turned off for swap entries), it causues huge_pte_offset() to
>> return NULL for poisoned PMDs.
>>
>> This behaviour of huge_pte_offset() leads to the error such as below
>> when munmap is called on poisoned hugepages.
>>
>> [  344.165544] mm/pgtable-generic.c:33: bad pmd 00083af00074.
>>
>> Fix huge_pte_offset() to return the poisoned PMD which is then
>> appropriately handled by the generic layer code.
>>
>> Signed-off-by: Punit Agrawal <punit.agra...@arm.com>
>> Cc: Catalin Marinas <catalin.mari...@arm.com>
>> Cc: Steve Capper <steve.cap...@arm.com>
>> ---
>>  arch/arm64/mm/hugetlbpage.c | 11 ++-
>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
>> index e25584d72396..9263f206353c 100644
>> --- a/arch/arm64/mm/hugetlbpage.c
>> +++ b/arch/arm64/mm/hugetlbpage.c
>> @@ -150,8 +150,17 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned 
>> long addr)
>> if (pud_huge(*pud))
>> return (pte_t *)pud;
>> pmd = pmd_offset(pud, addr);
>> +
>> +   /*
>> +* In case of HW Poisoning, a hugepage pmd can contain
>> +* poisoned entries. Poisoned entries are marked as swap
>> +* entries.
>> +*
>> +* For pmds that are not present, check to see if it could be
>> +* a swap entry (!present and !none) before giving up.
>> +*/
>> if (!pmd_present(*pmd))
>> -   return NULL;
>> +   return !pmd_none(*pmd) ? (pte_t *)pmd : NULL;
>
> I'm not sure we need to return NULL here when pmd_none(). If we use
> hugetlb at the pmd level we don't need to allocate a pmd page but just
> fall back to hugetlb_no_page() in hugetlb_fault(). The problem is we
> can't tell what kind of huge page we have when calling
> huge_pte_offset(), so we always rely on huge_pte_alloc(). But there are
> places where huge_pte_none() is checked explicitly and we would never
> return it from huge_pte_get().
>
> Can we improve the generic code to pass the huge page size to
> huge_pte_offset()? Otherwise we make all kind of assumptions/guesses in
> the arch code.

We'll certainly need the huge page size as we are unable to
differentiate between pmd and contiguous pmd for invalid entries too;
and we'll need to return a pointer to the "head" pte_t.

>
>>
>> if (pte_cont(pmd_pte(*pmd))) {
>> pmd = pmd_offset(
>
> Given that we can have huge pages at the pud level, we should address
> that as well. The generic huge_pte_offset() doesn't need to since it
> assumes huge pages at the pmd level only. If a pud is not present, you
> can't dereference it to find the pmd, hence returning NULL.
>
> Apart from hw poisoning, I think another use-case for non-present
> pmd/pud entries is is_hugetlb_entry_migration() (see hugetlb_fault()),
> so we need to fix this either way.
>
> We have a discrepancy between the pud_present and pmd_present. The
> latter was modified to fall back on pte_present because of THP which
> does not support puds (last time I checked). So if a pud is poisoned,
> huge_pte_offset thinks it is present and will try to get the pmd it
> points to.
>
> I think we can leave the pud_present() unchanged but fix the
> huge_pte_offset() to check for pud_table() before dereferencing,
> otherwise returning the actual value. And we need to figure out which
> huge page size we have when the pud/pmd is 0.

I don't understand the suggestions for puds, as they won't be contiguous?

Cheers,
--
Steve


Re: [PATCH V2] arm64: hwpoison: add VM_FAULT_HWPOISON[_LARGE] handling

2017-03-15 Thread Steve Capper
Hi,
Sorry for replying to this thread late.

On 15 March 2017 at 11:19, Catalin Marinas  wrote:
> Hi Punit,
>
> Adding David Woods since he seems to have added the arm64-specific
> huge_pte_offset() code.
>
> On Thu, Mar 09, 2017 at 05:46:36PM +, Punit Agrawal wrote:
>> From d5ad3f428e629c80b0f93f2bbdf99b4cae28c9bc Mon Sep 17 00:00:00 2001
>> From: Punit Agrawal 
>> Date: Thu, 9 Mar 2017 16:16:29 +
>> Subject: [PATCH] arm64: hugetlb: Fix huge_pte_offset to return poisoned pmd
>>
>> When memory failure is enabled, a poisoned hugepage PMD is marked as a
>> swap entry. As pmd_present() only checks for VALID and PROT_NONE
>> bits (turned off for swap entries), it causues huge_pte_offset() to
>> return NULL for poisoned PMDs.
>>
>> This behaviour of huge_pte_offset() leads to the error such as below
>> when munmap is called on poisoned hugepages.
>>
>> [  344.165544] mm/pgtable-generic.c:33: bad pmd 00083af00074.
>>
>> Fix huge_pte_offset() to return the poisoned PMD which is then
>> appropriately handled by the generic layer code.
>>
>> Signed-off-by: Punit Agrawal 
>> Cc: Catalin Marinas 
>> Cc: Steve Capper 
>> ---
>>  arch/arm64/mm/hugetlbpage.c | 11 ++-
>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
>> index e25584d72396..9263f206353c 100644
>> --- a/arch/arm64/mm/hugetlbpage.c
>> +++ b/arch/arm64/mm/hugetlbpage.c
>> @@ -150,8 +150,17 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned 
>> long addr)
>> if (pud_huge(*pud))
>> return (pte_t *)pud;
>> pmd = pmd_offset(pud, addr);
>> +
>> +   /*
>> +* In case of HW Poisoning, a hugepage pmd can contain
>> +* poisoned entries. Poisoned entries are marked as swap
>> +* entries.
>> +*
>> +* For pmds that are not present, check to see if it could be
>> +* a swap entry (!present and !none) before giving up.
>> +*/
>> if (!pmd_present(*pmd))
>> -   return NULL;
>> +   return !pmd_none(*pmd) ? (pte_t *)pmd : NULL;
>
> I'm not sure we need to return NULL here when pmd_none(). If we use
> hugetlb at the pmd level we don't need to allocate a pmd page but just
> fall back to hugetlb_no_page() in hugetlb_fault(). The problem is we
> can't tell what kind of huge page we have when calling
> huge_pte_offset(), so we always rely on huge_pte_alloc(). But there are
> places where huge_pte_none() is checked explicitly and we would never
> return it from huge_pte_get().
>
> Can we improve the generic code to pass the huge page size to
> huge_pte_offset()? Otherwise we make all kind of assumptions/guesses in
> the arch code.

We'll certainly need the huge page size as we are unable to
differentiate between pmd and contiguous pmd for invalid entries too;
and we'll need to return a pointer to the "head" pte_t.

>
>>
>> if (pte_cont(pmd_pte(*pmd))) {
>> pmd = pmd_offset(
>
> Given that we can have huge pages at the pud level, we should address
> that as well. The generic huge_pte_offset() doesn't need to since it
> assumes huge pages at the pmd level only. If a pud is not present, you
> can't dereference it to find the pmd, hence returning NULL.
>
> Apart from hw poisoning, I think another use-case for non-present
> pmd/pud entries is is_hugetlb_entry_migration() (see hugetlb_fault()),
> so we need to fix this either way.
>
> We have a discrepancy between the pud_present and pmd_present. The
> latter was modified to fall back on pte_present because of THP which
> does not support puds (last time I checked). So if a pud is poisoned,
> huge_pte_offset thinks it is present and will try to get the pmd it
> points to.
>
> I think we can leave the pud_present() unchanged but fix the
> huge_pte_offset() to check for pud_table() before dereferencing,
> otherwise returning the actual value. And we need to figure out which
> huge page size we have when the pud/pmd is 0.

I don't understand the suggestions for puds, as they won't be contiguous?

Cheers,
--
Steve


Re: [PATCH v3 0/2] iov_iter: allow iov_iter_get_pages_alloc to allocate more pages per call

2017-02-13 Thread Steve Capper
On Fri, Feb 03, 2017 at 11:28:48AM -0800, Linus Torvalds wrote:
> On Fri, Feb 3, 2017 at 11:08 AM, Al Viro  wrote:
> >
> > On x86 it does.  I don't see anything equivalent in mm/gup.c one, and the
> > only kinda-sorta similar thing (access_ok() in __get_user_pages_fast()
> > there) is vulnerable to e.g. access via kernel_write().
> 
> Yeah, access_ok() is bogus. It needs to just check against TASK_SIZE
> or whatever.
> 
> > doesn't look promising - access_ok() is never sufficient.  Something like
> > _PAGE_USER tests in x86 one solves that problem, but if anything similar
> > works for HAVE_GENERIC_RCU_GUP I don't see it.  Thus the question re
> > what am I missing here...
> 
> Ok, I definitely agree that it looks like __get_user_pages_fast() just
> needs to get rid of the access_ok() and replace it with a proper check
> for the user address space range.
> 
> Looks like arm[64] and powerpc.are the current users. Adding in some
> people involved with the original submission a few years ago.

Hi,

[ Apologies for my late reply, I was on vacation then catchup... ]

> 
> I do note that the x86 __get_user_pages_fast() thing looks dodgy too.
> 
> In particular, we do it right in the *real* get_user_pages_fast(), see
> commit 7f8189068726 ("x86: don't use 'access_ok()' as a range check in
> get_user_pages_fast()"). But then the same bug was re-introduced when
> the "irq safe" version was merged. As well as in the GENERIC_RCU_GUP
> version.
> 
> Gaah. Apparently PeterZ copied the old buggy version before the fix
> when he added __get_user_pages_fast() in commit 465a454f254e ("x86,
> mm: Add __get_user_pages_fast()").
> 
> I guess it could be considered a merge error (both happened during the
> 2.6.31 merge window).
> 

Okay so looking at what we have for access_ok(.) on arm64, my
understanding is that we perform a 65-bit add/compare (in assembler) to
see whether or not the range is below the current_thread_info->addr_limit.
So I think this is a roundabout way of checking for no-wrap around and <= 
TASK_SIZE.

Looking at powerpc, I see it's a little different...

So if it sounds reasonable to folk I was going to send a patch to
replace the call to access_ok(.) with a wraparound + TASK_SIZE check
written explicitly in C? (and remove some of the comments talking about
access_ok(.)).

Cheers,
-- 
Steve


Re: [PATCH v3 0/2] iov_iter: allow iov_iter_get_pages_alloc to allocate more pages per call

2017-02-13 Thread Steve Capper
On Fri, Feb 03, 2017 at 11:28:48AM -0800, Linus Torvalds wrote:
> On Fri, Feb 3, 2017 at 11:08 AM, Al Viro  wrote:
> >
> > On x86 it does.  I don't see anything equivalent in mm/gup.c one, and the
> > only kinda-sorta similar thing (access_ok() in __get_user_pages_fast()
> > there) is vulnerable to e.g. access via kernel_write().
> 
> Yeah, access_ok() is bogus. It needs to just check against TASK_SIZE
> or whatever.
> 
> > doesn't look promising - access_ok() is never sufficient.  Something like
> > _PAGE_USER tests in x86 one solves that problem, but if anything similar
> > works for HAVE_GENERIC_RCU_GUP I don't see it.  Thus the question re
> > what am I missing here...
> 
> Ok, I definitely agree that it looks like __get_user_pages_fast() just
> needs to get rid of the access_ok() and replace it with a proper check
> for the user address space range.
> 
> Looks like arm[64] and powerpc.are the current users. Adding in some
> people involved with the original submission a few years ago.

Hi,

[ Apologies for my late reply, I was on vacation then catchup... ]

> 
> I do note that the x86 __get_user_pages_fast() thing looks dodgy too.
> 
> In particular, we do it right in the *real* get_user_pages_fast(), see
> commit 7f8189068726 ("x86: don't use 'access_ok()' as a range check in
> get_user_pages_fast()"). But then the same bug was re-introduced when
> the "irq safe" version was merged. As well as in the GENERIC_RCU_GUP
> version.
> 
> Gaah. Apparently PeterZ copied the old buggy version before the fix
> when he added __get_user_pages_fast() in commit 465a454f254e ("x86,
> mm: Add __get_user_pages_fast()").
> 
> I guess it could be considered a merge error (both happened during the
> 2.6.31 merge window).
> 

Okay so looking at what we have for access_ok(.) on arm64, my
understanding is that we perform a 65-bit add/compare (in assembler) to
see whether or not the range is below the current_thread_info->addr_limit.
So I think this is a roundabout way of checking for no-wrap around and <= 
TASK_SIZE.

Looking at powerpc, I see it's a little different...

So if it sounds reasonable to folk I was going to send a patch to
replace the call to access_ok(.) with a wraparound + TASK_SIZE check
written explicitly in C? (and remove some of the comments talking about
access_ok(.)).

Cheers,
-- 
Steve


[PATCH v2] rmap: Fix compound check logic in page_remove_file_rmap

2016-08-10 Thread Steve Capper
In page_remove_file_rmap(.) we have the following check:
  VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);

This is meant to check for either HugeTLB pages or THP when a compound
page is passed in.

Unfortunately, if one disables CONFIG_TRANSPARENT_HUGEPAGE, then
PageTransHuge(.) will always return false, provoking BUGs when one runs
the libhugetlbfs test suite.

This patch replaces PageTransHuge(), with PageHead() which will work for
both HugeTLB and THP.

Fixes: dd78fedde4b9 ("rmap: support file thp")
Cc: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Signed-off-by: Steve Capper <steve.cap...@arm.com>

---

v2 - switch to PageHead as suggested by Kirill.
---
 mm/rmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 709bc83..1180340 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1303,7 +1303,7 @@ static void page_remove_file_rmap(struct page *page, bool 
compound)
 {
int i, nr = 1;
 
-   VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
+   VM_BUG_ON_PAGE(compound && !PageHead(page), page);
lock_page_memcg(page);
 
/* Hugepages are not counted in NR_FILE_MAPPED for now. */
-- 
1.8.3.1



[PATCH v2] rmap: Fix compound check logic in page_remove_file_rmap

2016-08-10 Thread Steve Capper
In page_remove_file_rmap(.) we have the following check:
  VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);

This is meant to check for either HugeTLB pages or THP when a compound
page is passed in.

Unfortunately, if one disables CONFIG_TRANSPARENT_HUGEPAGE, then
PageTransHuge(.) will always return false, provoking BUGs when one runs
the libhugetlbfs test suite.

This patch replaces PageTransHuge(), with PageHead() which will work for
both HugeTLB and THP.

Fixes: dd78fedde4b9 ("rmap: support file thp")
Cc: Kirill A. Shutemov 
Cc: Andrew Morton 
Signed-off-by: Steve Capper 

---

v2 - switch to PageHead as suggested by Kirill.
---
 mm/rmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 709bc83..1180340 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1303,7 +1303,7 @@ static void page_remove_file_rmap(struct page *page, bool 
compound)
 {
int i, nr = 1;
 
-   VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
+   VM_BUG_ON_PAGE(compound && !PageHead(page), page);
lock_page_memcg(page);
 
/* Hugepages are not counted in NR_FILE_MAPPED for now. */
-- 
1.8.3.1



[PATCH] rmap: Fix compound check logic in page_remove_file_rmap

2016-08-09 Thread Steve Capper
In page_remove_file_rmap(.) we have the following check:
  VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);

This is meant to check for either HugeTLB pages or THP when a compound
page is passed in.

Unfortunately, if one disables CONFIG_TRANSPARENT_HUGEPAGE, then
PageTransHuge(.) will always return false provoking BUGs when one runs
the libhugetlbfs test suite.

Changing the definition of PageTransHuge to be defined for
!CONFIG_TRANSPARENT_HUGEPAGE turned out to provoke build bugs; so this
patch instead replaces the errant check with:
  PageTransHuge(page) || PageHuge(page)

Fixes: dd78fedde4b9 ("rmap: support file thp")
Cc: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Signed-off-by: Steve Capper <steve.cap...@arm.com>
---
 mm/rmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 709bc83..ad8fc51 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1303,7 +1303,7 @@ static void page_remove_file_rmap(struct page *page, bool 
compound)
 {
int i, nr = 1;
 
-   VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
+   VM_BUG_ON_PAGE(compound && !(PageTransHuge(page) || PageHuge(page)), 
page);
lock_page_memcg(page);
 
/* Hugepages are not counted in NR_FILE_MAPPED for now. */
-- 
1.8.3.1



[PATCH] rmap: Fix compound check logic in page_remove_file_rmap

2016-08-09 Thread Steve Capper
In page_remove_file_rmap(.) we have the following check:
  VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);

This is meant to check for either HugeTLB pages or THP when a compound
page is passed in.

Unfortunately, if one disables CONFIG_TRANSPARENT_HUGEPAGE, then
PageTransHuge(.) will always return false provoking BUGs when one runs
the libhugetlbfs test suite.

Changing the definition of PageTransHuge to be defined for
!CONFIG_TRANSPARENT_HUGEPAGE turned out to provoke build bugs; so this
patch instead replaces the errant check with:
  PageTransHuge(page) || PageHuge(page)

Fixes: dd78fedde4b9 ("rmap: support file thp")
Cc: Kirill A. Shutemov 
Cc: Andrew Morton 
Signed-off-by: Steve Capper 
---
 mm/rmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 709bc83..ad8fc51 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1303,7 +1303,7 @@ static void page_remove_file_rmap(struct page *page, bool 
compound)
 {
int i, nr = 1;
 
-   VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
+   VM_BUG_ON_PAGE(compound && !(PageTransHuge(page) || PageHuge(page)), 
page);
lock_page_memcg(page);
 
/* Hugepages are not counted in NR_FILE_MAPPED for now. */
-- 
1.8.3.1



Re: [PATCH] arm64: Add config to limit user space to 47bits

2016-07-13 Thread Steve Capper
Hi Alex,

Thanks for posting this.

On Wed, Jul 13, 2016 at 06:14:11PM +0200, Alexander Graf wrote:
> On 07/13/2016 05:59 PM, Ard Biesheuvel wrote:
> >On 13 July 2016 at 17:42, Alexander Graf  wrote:
> >>Some user space applications are known to break with 48 bits virtual
> >known by whom? At least I wasn't aware of it, so could you please
> >share some examples?
> 
> Sure! Known to me so far are:
> 
>   * mozjs17
>   * mozjs24
>   * mozjs38
>   * js-1.8.5
>   * java-1.7 (older JITs, fixed in newer ones)
> 
> I'm not sure if there are more, but the fact that I've run into this
> problem more than once doesn't make me incredibly happy :).
> 

I came across this too: on bootup via polkitd (which pulled in mozJS) :-(.

> >
> >>address space. As interim step until the world is healed and everyone
> >>embraces correct code, this patch allows to only expose 47 bits of
> >>virtual address space to user space.
> >>
> >Is this a code generation/toolchain issue?
> 
> mozjs uses a single 64bit value to combine doubles, ints and
> pointers into a single variable. It is very smart and uses the upper
> 17 bits for metadata such as "which type of variable is this".
> Coincidentally those bits happen to overlap the "double is an
> infinite number" bits, so that you can also express a NaN with it.
> When using such a value, the upper 17 bits get masked out.
> 
> That one was fixed upstream by force allocating the javascript heap
> starting at a fixed location which is below 47 bits.
> 
> js-1.8.5 has the same as above, but also uses pointers to .rodata as
> javascript pointers, so it doesn't only use the heap, it also uses
> pointers to the library itself, which gets mapped high up the
> address space. I don't have a solution for that one yet.

Is this Spidermonkey 1.8.5? I wasn't aware of this issue.

> 
> IcedTea for java-1.7 had a bug where it incorrectly caused an
> overflow when trying to calculating a relative adrp offset from
>  to , so that the resulting
> pointer had the upper bits set as 1s. That one is long fixed
> upstream, we only ran into it because we used an ancient IcedTea
> snapshot.

I would recommend updating the sources used for OpenJDK anyway as there
have been a few other stability and performance fixes put in over the
last year to my knowledge.

> 
> My main concern however is with code that I do not know is broken today.
> 

I think if we set the 47-bit VA we are just ignoring the fundamental
problem and even allowing the problem to get worse (as future code may
adopt unsafe pointer tagging); thus I agree with Mark Rutland's NAK.

Personally, I would only ever tag bits in the VA space that I control
(i.e. at the bottom of the pointer if I enforce alignment).

Cheers,
-- 
Steve


Re: [PATCH] arm64: Add config to limit user space to 47bits

2016-07-13 Thread Steve Capper
Hi Alex,

Thanks for posting this.

On Wed, Jul 13, 2016 at 06:14:11PM +0200, Alexander Graf wrote:
> On 07/13/2016 05:59 PM, Ard Biesheuvel wrote:
> >On 13 July 2016 at 17:42, Alexander Graf  wrote:
> >>Some user space applications are known to break with 48 bits virtual
> >known by whom? At least I wasn't aware of it, so could you please
> >share some examples?
> 
> Sure! Known to me so far are:
> 
>   * mozjs17
>   * mozjs24
>   * mozjs38
>   * js-1.8.5
>   * java-1.7 (older JITs, fixed in newer ones)
> 
> I'm not sure if there are more, but the fact that I've run into this
> problem more than once doesn't make me incredibly happy :).
> 

I came across this too: on bootup via polkitd (which pulled in mozJS) :-(.

> >
> >>address space. As interim step until the world is healed and everyone
> >>embraces correct code, this patch allows to only expose 47 bits of
> >>virtual address space to user space.
> >>
> >Is this a code generation/toolchain issue?
> 
> mozjs uses a single 64bit value to combine doubles, ints and
> pointers into a single variable. It is very smart and uses the upper
> 17 bits for metadata such as "which type of variable is this".
> Coincidentally those bits happen to overlap the "double is an
> infinite number" bits, so that you can also express a NaN with it.
> When using such a value, the upper 17 bits get masked out.
> 
> That one was fixed upstream by force allocating the javascript heap
> starting at a fixed location which is below 47 bits.
> 
> js-1.8.5 has the same as above, but also uses pointers to .rodata as
> javascript pointers, so it doesn't only use the heap, it also uses
> pointers to the library itself, which gets mapped high up the
> address space. I don't have a solution for that one yet.

Is this Spidermonkey 1.8.5? I wasn't aware of this issue.

> 
> IcedTea for java-1.7 had a bug where it incorrectly caused an
> overflow when trying to calculating a relative adrp offset from
>  to , so that the resulting
> pointer had the upper bits set as 1s. That one is long fixed
> upstream, we only ran into it because we used an ancient IcedTea
> snapshot.

I would recommend updating the sources used for OpenJDK anyway as there
have been a few other stability and performance fixes put in over the
last year to my knowledge.

> 
> My main concern however is with code that I do not know is broken today.
> 

I think if we set the 47-bit VA we are just ignoring the fundamental
problem and even allowing the problem to get worse (as future code may
adopt unsafe pointer tagging); thus I agree with Mark Rutland's NAK.

Personally, I would only ever tag bits in the VA space that I control
(i.e. at the bottom of the pointer if I enforce alignment).

Cheers,
-- 
Steve


Re: [PATCH v16 1/6] efi: ARM/arm64: ignore DT memory nodes instead of removing them

2016-04-14 Thread Steve Capper
On Thu, Apr 14, 2016 at 01:10:35PM +0200, Ard Biesheuvel wrote:
> On 14 April 2016 at 13:02, Steve Capper <steve.cap...@arm.com> wrote:
> > On Fri, Apr 08, 2016 at 03:50:23PM -0700, David Daney wrote:
> >> From: Ard Biesheuvel <ard.biesheu...@linaro.org>
> >>
> >> There are two problems with the UEFI stub DT memory node removal
> >> routine:
> >> - it deletes nodes as it traverses the tree, which happens to work
> >>   but is not supported, as deletion invalidates the node iterator;
> >> - deleting memory nodes entirely may discard annotations in the form
> >>   of additional properties on the nodes.
> >>
> >> Since the discovery of DT memory nodes occurs strictly before the
> >> UEFI init sequence, we can simply clear the memblock memory table
> >> before parsing the UEFI memory map. This way, it is no longer
> >> necessary to remove the nodes, so we can remove that logic from the
> >> stub as well.
> >>
> >> Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
> >> Signed-off-by: David Daney <david.da...@cavium.com>
> >> ---
> >>  drivers/firmware/efi/arm-init.c|  8 
> >>  drivers/firmware/efi/libstub/fdt.c | 24 +---
> >>  2 files changed, 9 insertions(+), 23 deletions(-)
> >>
> >> diff --git a/drivers/firmware/efi/arm-init.c 
> >> b/drivers/firmware/efi/arm-init.c
> >> index aa1f743..5d6945b 100644
> >> --- a/drivers/firmware/efi/arm-init.c
> >> +++ b/drivers/firmware/efi/arm-init.c
> >> @@ -143,6 +143,14 @@ static __init void reserve_regions(void)
> >>   if (efi_enabled(EFI_DBG))
> >>   pr_info("Processing EFI memory map:\n");
> >>
> >> + /*
> >> +  * Discard memblocks discovered so far: if there are any at this
> >> +  * point, they originate from memory nodes in the DT, and UEFI
> >> +  * uses its own memory map instead.
> >> +  */
> >> + memblock_dump_all();
> >> + memblock_remove(0, ULLONG_MAX);
> >> +
> >
> > Does this change need to be applied to any other architectures given
> > that deletion code has been removed from libstub below?
> >
> 
> The 'generic' libstub code below is only used by ARM, so we're safe
> here in that regard.

Thanks Ard,
In that case, FWIW:
Acked-by: Steve Capper <steve.cap...@arm.com>

Cheers,
-- 
Steve

> 
> 
> >>   for_each_efi_memory_desc(, md) {
> >>   paddr = md->phys_addr;
> >>   npages = md->num_pages;
> >> diff --git a/drivers/firmware/efi/libstub/fdt.c 
> >> b/drivers/firmware/efi/libstub/fdt.c
> >> index 6dba78a..e58abfa 100644
> >> --- a/drivers/firmware/efi/libstub/fdt.c
> >> +++ b/drivers/firmware/efi/libstub/fdt.c
> >> @@ -24,7 +24,7 @@ efi_status_t update_fdt(efi_system_table_t *sys_table, 
> >> void *orig_fdt,
> >>   unsigned long map_size, unsigned long desc_size,
> >>   u32 desc_ver)
> >>  {
> >> - int node, prev, num_rsv;
> >> + int node, num_rsv;
> >>   int status;
> >>   u32 fdt_val32;
> >>   u64 fdt_val64;
> >> @@ -54,28 +54,6 @@ efi_status_t update_fdt(efi_system_table_t *sys_table, 
> >> void *orig_fdt,
> >>   goto fdt_set_fail;
> >>
> >>   /*
> >> -  * Delete any memory nodes present. We must delete nodes which
> >> -  * early_init_dt_scan_memory may try to use.
> >> -  */
> >> - prev = 0;
> >> - for (;;) {
> >> - const char *type;
> >> - int len;
> >> -
> >> - node = fdt_next_node(fdt, prev, NULL);
> >> - if (node < 0)
> >> - break;
> >> -
> >> - type = fdt_getprop(fdt, node, "device_type", );
> >> - if (type && strncmp(type, "memory", len) == 0) {
> >> - fdt_del_node(fdt, node);
> >> - continue;
> >> - }
> >> -
> >> - prev = node;
> >> - }
> >> -
> >> - /*
> >>* Delete all memory reserve map entries. When booting via UEFI,
> >>* kernel will use the UEFI memory map to find reserved regions.
> >>*/
> >> --
> >> 1.8.3.1
> >>
> >>
> >> ___
> >> linux-arm-kernel mailing list
> >> linux-arm-ker...@lists.infradead.org
> >> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> >>
> 


Re: [PATCH v16 1/6] efi: ARM/arm64: ignore DT memory nodes instead of removing them

2016-04-14 Thread Steve Capper
On Thu, Apr 14, 2016 at 01:10:35PM +0200, Ard Biesheuvel wrote:
> On 14 April 2016 at 13:02, Steve Capper  wrote:
> > On Fri, Apr 08, 2016 at 03:50:23PM -0700, David Daney wrote:
> >> From: Ard Biesheuvel 
> >>
> >> There are two problems with the UEFI stub DT memory node removal
> >> routine:
> >> - it deletes nodes as it traverses the tree, which happens to work
> >>   but is not supported, as deletion invalidates the node iterator;
> >> - deleting memory nodes entirely may discard annotations in the form
> >>   of additional properties on the nodes.
> >>
> >> Since the discovery of DT memory nodes occurs strictly before the
> >> UEFI init sequence, we can simply clear the memblock memory table
> >> before parsing the UEFI memory map. This way, it is no longer
> >> necessary to remove the nodes, so we can remove that logic from the
> >> stub as well.
> >>
> >> Signed-off-by: Ard Biesheuvel 
> >> Signed-off-by: David Daney 
> >> ---
> >>  drivers/firmware/efi/arm-init.c|  8 
> >>  drivers/firmware/efi/libstub/fdt.c | 24 +---
> >>  2 files changed, 9 insertions(+), 23 deletions(-)
> >>
> >> diff --git a/drivers/firmware/efi/arm-init.c 
> >> b/drivers/firmware/efi/arm-init.c
> >> index aa1f743..5d6945b 100644
> >> --- a/drivers/firmware/efi/arm-init.c
> >> +++ b/drivers/firmware/efi/arm-init.c
> >> @@ -143,6 +143,14 @@ static __init void reserve_regions(void)
> >>   if (efi_enabled(EFI_DBG))
> >>   pr_info("Processing EFI memory map:\n");
> >>
> >> + /*
> >> +  * Discard memblocks discovered so far: if there are any at this
> >> +  * point, they originate from memory nodes in the DT, and UEFI
> >> +  * uses its own memory map instead.
> >> +  */
> >> + memblock_dump_all();
> >> + memblock_remove(0, ULLONG_MAX);
> >> +
> >
> > Does this change need to be applied to any other architectures given
> > that deletion code has been removed from libstub below?
> >
> 
> The 'generic' libstub code below is only used by ARM, so we're safe
> here in that regard.

Thanks Ard,
In that case, FWIW:
Acked-by: Steve Capper 

Cheers,
-- 
Steve

> 
> 
> >>   for_each_efi_memory_desc(, md) {
> >>   paddr = md->phys_addr;
> >>   npages = md->num_pages;
> >> diff --git a/drivers/firmware/efi/libstub/fdt.c 
> >> b/drivers/firmware/efi/libstub/fdt.c
> >> index 6dba78a..e58abfa 100644
> >> --- a/drivers/firmware/efi/libstub/fdt.c
> >> +++ b/drivers/firmware/efi/libstub/fdt.c
> >> @@ -24,7 +24,7 @@ efi_status_t update_fdt(efi_system_table_t *sys_table, 
> >> void *orig_fdt,
> >>   unsigned long map_size, unsigned long desc_size,
> >>   u32 desc_ver)
> >>  {
> >> - int node, prev, num_rsv;
> >> + int node, num_rsv;
> >>   int status;
> >>   u32 fdt_val32;
> >>   u64 fdt_val64;
> >> @@ -54,28 +54,6 @@ efi_status_t update_fdt(efi_system_table_t *sys_table, 
> >> void *orig_fdt,
> >>   goto fdt_set_fail;
> >>
> >>   /*
> >> -  * Delete any memory nodes present. We must delete nodes which
> >> -  * early_init_dt_scan_memory may try to use.
> >> -  */
> >> - prev = 0;
> >> - for (;;) {
> >> - const char *type;
> >> - int len;
> >> -
> >> - node = fdt_next_node(fdt, prev, NULL);
> >> - if (node < 0)
> >> - break;
> >> -
> >> - type = fdt_getprop(fdt, node, "device_type", );
> >> - if (type && strncmp(type, "memory", len) == 0) {
> >> - fdt_del_node(fdt, node);
> >> - continue;
> >> - }
> >> -
> >> - prev = node;
> >> - }
> >> -
> >> - /*
> >>* Delete all memory reserve map entries. When booting via UEFI,
> >>* kernel will use the UEFI memory map to find reserved regions.
> >>*/
> >> --
> >> 1.8.3.1
> >>
> >>
> >> ___
> >> linux-arm-kernel mailing list
> >> linux-arm-ker...@lists.infradead.org
> >> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> >>
> 


Re: [PATCH v16 1/6] efi: ARM/arm64: ignore DT memory nodes instead of removing them

2016-04-14 Thread Steve Capper
On Fri, Apr 08, 2016 at 03:50:23PM -0700, David Daney wrote:
> From: Ard Biesheuvel 
> 
> There are two problems with the UEFI stub DT memory node removal
> routine:
> - it deletes nodes as it traverses the tree, which happens to work
>   but is not supported, as deletion invalidates the node iterator;
> - deleting memory nodes entirely may discard annotations in the form
>   of additional properties on the nodes.
> 
> Since the discovery of DT memory nodes occurs strictly before the
> UEFI init sequence, we can simply clear the memblock memory table
> before parsing the UEFI memory map. This way, it is no longer
> necessary to remove the nodes, so we can remove that logic from the
> stub as well.
> 
> Signed-off-by: Ard Biesheuvel 
> Signed-off-by: David Daney 
> ---
>  drivers/firmware/efi/arm-init.c|  8 
>  drivers/firmware/efi/libstub/fdt.c | 24 +---
>  2 files changed, 9 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/firmware/efi/arm-init.c b/drivers/firmware/efi/arm-init.c
> index aa1f743..5d6945b 100644
> --- a/drivers/firmware/efi/arm-init.c
> +++ b/drivers/firmware/efi/arm-init.c
> @@ -143,6 +143,14 @@ static __init void reserve_regions(void)
>   if (efi_enabled(EFI_DBG))
>   pr_info("Processing EFI memory map:\n");
>  
> + /*
> +  * Discard memblocks discovered so far: if there are any at this
> +  * point, they originate from memory nodes in the DT, and UEFI
> +  * uses its own memory map instead.
> +  */
> + memblock_dump_all();
> + memblock_remove(0, ULLONG_MAX);
> +

Does this change need to be applied to any other architectures given
that deletion code has been removed from libstub below?

Cheers,
-- 
Steve

>   for_each_efi_memory_desc(, md) {
>   paddr = md->phys_addr;
>   npages = md->num_pages;
> diff --git a/drivers/firmware/efi/libstub/fdt.c 
> b/drivers/firmware/efi/libstub/fdt.c
> index 6dba78a..e58abfa 100644
> --- a/drivers/firmware/efi/libstub/fdt.c
> +++ b/drivers/firmware/efi/libstub/fdt.c
> @@ -24,7 +24,7 @@ efi_status_t update_fdt(efi_system_table_t *sys_table, void 
> *orig_fdt,
>   unsigned long map_size, unsigned long desc_size,
>   u32 desc_ver)
>  {
> - int node, prev, num_rsv;
> + int node, num_rsv;
>   int status;
>   u32 fdt_val32;
>   u64 fdt_val64;
> @@ -54,28 +54,6 @@ efi_status_t update_fdt(efi_system_table_t *sys_table, 
> void *orig_fdt,
>   goto fdt_set_fail;
>  
>   /*
> -  * Delete any memory nodes present. We must delete nodes which
> -  * early_init_dt_scan_memory may try to use.
> -  */
> - prev = 0;
> - for (;;) {
> - const char *type;
> - int len;
> -
> - node = fdt_next_node(fdt, prev, NULL);
> - if (node < 0)
> - break;
> -
> - type = fdt_getprop(fdt, node, "device_type", );
> - if (type && strncmp(type, "memory", len) == 0) {
> - fdt_del_node(fdt, node);
> - continue;
> - }
> -
> - prev = node;
> - }
> -
> - /*
>* Delete all memory reserve map entries. When booting via UEFI,
>* kernel will use the UEFI memory map to find reserved regions.
>*/
> -- 
> 1.8.3.1
> 
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 


Re: [PATCH v16 1/6] efi: ARM/arm64: ignore DT memory nodes instead of removing them

2016-04-14 Thread Steve Capper
On Fri, Apr 08, 2016 at 03:50:23PM -0700, David Daney wrote:
> From: Ard Biesheuvel 
> 
> There are two problems with the UEFI stub DT memory node removal
> routine:
> - it deletes nodes as it traverses the tree, which happens to work
>   but is not supported, as deletion invalidates the node iterator;
> - deleting memory nodes entirely may discard annotations in the form
>   of additional properties on the nodes.
> 
> Since the discovery of DT memory nodes occurs strictly before the
> UEFI init sequence, we can simply clear the memblock memory table
> before parsing the UEFI memory map. This way, it is no longer
> necessary to remove the nodes, so we can remove that logic from the
> stub as well.
> 
> Signed-off-by: Ard Biesheuvel 
> Signed-off-by: David Daney 
> ---
>  drivers/firmware/efi/arm-init.c|  8 
>  drivers/firmware/efi/libstub/fdt.c | 24 +---
>  2 files changed, 9 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/firmware/efi/arm-init.c b/drivers/firmware/efi/arm-init.c
> index aa1f743..5d6945b 100644
> --- a/drivers/firmware/efi/arm-init.c
> +++ b/drivers/firmware/efi/arm-init.c
> @@ -143,6 +143,14 @@ static __init void reserve_regions(void)
>   if (efi_enabled(EFI_DBG))
>   pr_info("Processing EFI memory map:\n");
>  
> + /*
> +  * Discard memblocks discovered so far: if there are any at this
> +  * point, they originate from memory nodes in the DT, and UEFI
> +  * uses its own memory map instead.
> +  */
> + memblock_dump_all();
> + memblock_remove(0, ULLONG_MAX);
> +

Does this change need to be applied to any other architectures given
that deletion code has been removed from libstub below?

Cheers,
-- 
Steve

>   for_each_efi_memory_desc(, md) {
>   paddr = md->phys_addr;
>   npages = md->num_pages;
> diff --git a/drivers/firmware/efi/libstub/fdt.c 
> b/drivers/firmware/efi/libstub/fdt.c
> index 6dba78a..e58abfa 100644
> --- a/drivers/firmware/efi/libstub/fdt.c
> +++ b/drivers/firmware/efi/libstub/fdt.c
> @@ -24,7 +24,7 @@ efi_status_t update_fdt(efi_system_table_t *sys_table, void 
> *orig_fdt,
>   unsigned long map_size, unsigned long desc_size,
>   u32 desc_ver)
>  {
> - int node, prev, num_rsv;
> + int node, num_rsv;
>   int status;
>   u32 fdt_val32;
>   u64 fdt_val64;
> @@ -54,28 +54,6 @@ efi_status_t update_fdt(efi_system_table_t *sys_table, 
> void *orig_fdt,
>   goto fdt_set_fail;
>  
>   /*
> -  * Delete any memory nodes present. We must delete nodes which
> -  * early_init_dt_scan_memory may try to use.
> -  */
> - prev = 0;
> - for (;;) {
> - const char *type;
> - int len;
> -
> - node = fdt_next_node(fdt, prev, NULL);
> - if (node < 0)
> - break;
> -
> - type = fdt_getprop(fdt, node, "device_type", );
> - if (type && strncmp(type, "memory", len) == 0) {
> - fdt_del_node(fdt, node);
> - continue;
> - }
> -
> - prev = node;
> - }
> -
> - /*
>* Delete all memory reserve map entries. When booting via UEFI,
>* kernel will use the UEFI memory map to find reserved regions.
>*/
> -- 
> 1.8.3.1
> 
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 


Re: [PATCH v16 6/6] arm64, mm, numa: Add NUMA balancing support for arm64.

2016-04-13 Thread Steve Capper
On Fri, Apr 08, 2016 at 03:50:28PM -0700, David Daney wrote:
> From: Ganapatrao Kulkarni <gkulka...@caviumnetworks.com>
> 
> Enable NUMA balancing for arm64 platforms.
> Add pte, pmd protnone helpers for use by automatic NUMA balancing.
> 
> Reviewed-by: Robert Richter <rrich...@cavium.com>
> Signed-off-by: Ganapatrao Kulkarni <gkulka...@caviumnetworks.com>
> Signed-off-by: David Daney <david.da...@cavium.com>
> ---
>  arch/arm64/Kconfig   |  1 +
>  arch/arm64/include/asm/pgtable.h | 15 +++
>  2 files changed, 16 insertions(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 99f9b55..a578080 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -11,6 +11,7 @@ config ARM64
>   select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
>   select ARCH_USE_CMPXCHG_LOCKREF
>   select ARCH_SUPPORTS_ATOMIC_RMW
> + select ARCH_SUPPORTS_NUMA_BALANCING
>   select ARCH_WANT_OPTIONAL_GPIOLIB
>   select ARCH_WANT_COMPAT_IPC_PARSE_VERSION
>   select ARCH_WANT_FRAME_POINTERS
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 989fef1..89b8f20 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -272,6 +272,21 @@ static inline pgprot_t mk_sect_prot(pgprot_t prot)
>   return __pgprot(pgprot_val(prot) & ~PTE_TABLE_BIT);
>  }
>  
> +#ifdef CONFIG_NUMA_BALANCING
> +/*
> + * See the comment in include/asm-generic/pgtable.h
> + */
> +static inline int pte_protnone(pte_t pte)
> +{
> + return (pte_val(pte) & (PTE_VALID | PTE_PROT_NONE)) == PTE_PROT_NONE;
> +}
> +
> +static inline int pmd_protnone(pmd_t pmd)
> +{
> + return pte_protnone(pmd_pte(pmd));
> +}
> +#endif
> +

Okay, this looks good to me. If we have a PROT_NONE VMA then this is
caught before going into do_numa_page or do_huge_pmd_numa_page (and
there is a BUG_ON inside these functions to catch stragglers.

I've given this a quick test with a PROT_NONE THP and everything worked
as expected (i.e. NUMA didn't trip up).

Reviewed-by: Steve Capper <steve.cap...@arm.com>

>  /*
>   * THP definitions.
>   */
> -- 
> 1.8.3.1
> 
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 


Re: [PATCH v16 6/6] arm64, mm, numa: Add NUMA balancing support for arm64.

2016-04-13 Thread Steve Capper
On Fri, Apr 08, 2016 at 03:50:28PM -0700, David Daney wrote:
> From: Ganapatrao Kulkarni 
> 
> Enable NUMA balancing for arm64 platforms.
> Add pte, pmd protnone helpers for use by automatic NUMA balancing.
> 
> Reviewed-by: Robert Richter 
> Signed-off-by: Ganapatrao Kulkarni 
> Signed-off-by: David Daney 
> ---
>  arch/arm64/Kconfig   |  1 +
>  arch/arm64/include/asm/pgtable.h | 15 +++
>  2 files changed, 16 insertions(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 99f9b55..a578080 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -11,6 +11,7 @@ config ARM64
>   select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
>   select ARCH_USE_CMPXCHG_LOCKREF
>   select ARCH_SUPPORTS_ATOMIC_RMW
> + select ARCH_SUPPORTS_NUMA_BALANCING
>   select ARCH_WANT_OPTIONAL_GPIOLIB
>   select ARCH_WANT_COMPAT_IPC_PARSE_VERSION
>   select ARCH_WANT_FRAME_POINTERS
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 989fef1..89b8f20 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -272,6 +272,21 @@ static inline pgprot_t mk_sect_prot(pgprot_t prot)
>   return __pgprot(pgprot_val(prot) & ~PTE_TABLE_BIT);
>  }
>  
> +#ifdef CONFIG_NUMA_BALANCING
> +/*
> + * See the comment in include/asm-generic/pgtable.h
> + */
> +static inline int pte_protnone(pte_t pte)
> +{
> + return (pte_val(pte) & (PTE_VALID | PTE_PROT_NONE)) == PTE_PROT_NONE;
> +}
> +
> +static inline int pmd_protnone(pmd_t pmd)
> +{
> + return pte_protnone(pmd_pte(pmd));
> +}
> +#endif
> +

Okay, this looks good to me. If we have a PROT_NONE VMA then this is
caught before going into do_numa_page or do_huge_pmd_numa_page (and
there is a BUG_ON inside these functions to catch stragglers.

I've given this a quick test with a PROT_NONE THP and everything worked
as expected (i.e. NUMA didn't trip up).

Reviewed-by: Steve Capper 

>  /*
>   * THP definitions.
>   */
> -- 
> 1.8.3.1
> 
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 


Re: [PATCH v15 5/6] arm64, numa: Add NUMA support for arm64 platforms.

2016-04-13 Thread Steve Capper
On Wed, Apr 13, 2016 at 03:09:08PM +0100, Steve Capper wrote:
> On Tue, Mar 08, 2016 at 11:59:46PM +, David Daney wrote:
> > From: Ganapatrao Kulkarni <gkulka...@caviumnetworks.com>
> > 
> > Attempt to get the memory and CPU NUMA node via of_numa.  If that
> > fails, default the dummy NUMA node and map all memory and CPUs to node
> > 0.
> > 
> > Tested-by: Shannon Zhao <shannon.z...@linaro.org>
> > Reviewed-by: Robert Richter <rrich...@cavium.com>
> > Signed-off-by: Ganapatrao Kulkarni <gkulka...@caviumnetworks.com>
> > Signed-off-by: David Daney <david.da...@cavium.com>
> 
> Hi David,
> 
> I have one minor comment below, but please feel free to add:
> Acked-by: Steve Capper <steve.cap...@arm.com>
> 

Whilst I learn how to use my email client, please also apply this
ack to the (nearly identical) patch in v16 of your series...

> Cheers,
> -- 
> Steve
> 
> > ---
> >  arch/arm64/Kconfig|  26 +++
> >  arch/arm64/include/asm/mmzone.h   |  12 ++
> >  arch/arm64/include/asm/numa.h |  45 +
> >  arch/arm64/include/asm/topology.h |  10 +
> >  arch/arm64/kernel/pci.c   |  10 +
> >  arch/arm64/kernel/setup.c |   4 +
> >  arch/arm64/kernel/smp.c   |   4 +
> >  arch/arm64/mm/Makefile|   1 +
> >  arch/arm64/mm/init.c  |  34 +++-
> >  arch/arm64/mm/mmu.c   |   1 +
> >  arch/arm64/mm/numa.c  | 396 
> > ++
> >  11 files changed, 538 insertions(+), 5 deletions(-)
> >  create mode 100644 arch/arm64/include/asm/mmzone.h
> >  create mode 100644 arch/arm64/include/asm/numa.h
> >  create mode 100644 arch/arm64/mm/numa.c
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 39f2203..7013087 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -74,6 +74,7 @@ config ARM64
> > select HAVE_HW_BREAKPOINT if PERF_EVENTS
> > select HAVE_IRQ_TIME_ACCOUNTING
> > select HAVE_MEMBLOCK
> > +   select HAVE_MEMBLOCK_NODE_MAP if NUMA
> > select HAVE_PATA_PLATFORM
> > select HAVE_PERF_EVENTS
> > select HAVE_PERF_REGS
> > @@ -96,6 +97,7 @@ config ARM64
> > select SYSCTL_EXCEPTION_TRACE
> > select HAVE_CONTEXT_TRACKING
> > select HAVE_ARM_SMCCC
> > +   select OF_NUMA if NUMA && OF
> > help
> >   ARM 64-bit (AArch64) Linux support.
> >  
> > @@ -545,6 +547,30 @@ config HOTPLUG_CPU
> >   Say Y here to experiment with turning CPUs off and on.  CPUs
> >   can be controlled through /sys/devices/system/cpu.
> >  
> > +# Common NUMA Features
> > +config NUMA
> > +   bool "Numa Memory Allocation and Scheduler Support"
> > +   depends on SMP
> > +   help
> > + Enable NUMA (Non Uniform Memory Access) support.
> > +
> > + The kernel will try to allocate memory used by a CPU on the
> > + local memory of the CPU and add some more
> > + NUMA awareness to the kernel.
> > +
> > +config NODES_SHIFT
> > +   int "Maximum NUMA Nodes (as a power of 2)"
> > +   range 1 10
> > +   default "2"
> > +   depends on NEED_MULTIPLE_NODES
> > +   help
> > + Specify the maximum number of NUMA Nodes available on the target
> > + system.  Increases memory reserved to accommodate various tables.
> > +
> > +config USE_PERCPU_NUMA_NODE_ID
> > +   def_bool y
> > +   depends on NUMA
> > +
> >  source kernel/Kconfig.preempt
> >  source kernel/Kconfig.hz
> >  
> > diff --git a/arch/arm64/include/asm/mmzone.h 
> > b/arch/arm64/include/asm/mmzone.h
> > new file mode 100644
> > index 000..a0de9e6
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/mmzone.h
> > @@ -0,0 +1,12 @@
> > +#ifndef __ASM_MMZONE_H
> > +#define __ASM_MMZONE_H
> > +
> > +#ifdef CONFIG_NUMA
> > +
> > +#include 
> > +
> > +extern struct pglist_data *node_data[];
> > +#define NODE_DATA(nid) (node_data[(nid)])
> > +
> > +#endif /* CONFIG_NUMA */
> > +#endif /* __ASM_MMZONE_H */
> > diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
> > new file mode 100644
> > index 000..e9b4f29
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/numa.h
> > @@ -0,0 +1,45 @@
> > +#ifndef __ASM_NUMA_H
> > +#define __ASM_NUMA_H
> > +
> > +#include 
> > +
> > +#ifdef CONFIG_NUMA
> > +
> > +/*

Re: [PATCH v15 5/6] arm64, numa: Add NUMA support for arm64 platforms.

2016-04-13 Thread Steve Capper
On Wed, Apr 13, 2016 at 03:09:08PM +0100, Steve Capper wrote:
> On Tue, Mar 08, 2016 at 11:59:46PM +, David Daney wrote:
> > From: Ganapatrao Kulkarni 
> > 
> > Attempt to get the memory and CPU NUMA node via of_numa.  If that
> > fails, default the dummy NUMA node and map all memory and CPUs to node
> > 0.
> > 
> > Tested-by: Shannon Zhao 
> > Reviewed-by: Robert Richter 
> > Signed-off-by: Ganapatrao Kulkarni 
> > Signed-off-by: David Daney 
> 
> Hi David,
> 
> I have one minor comment below, but please feel free to add:
> Acked-by: Steve Capper 
> 

Whilst I learn how to use my email client, please also apply this
ack to the (nearly identical) patch in v16 of your series...

> Cheers,
> -- 
> Steve
> 
> > ---
> >  arch/arm64/Kconfig|  26 +++
> >  arch/arm64/include/asm/mmzone.h   |  12 ++
> >  arch/arm64/include/asm/numa.h |  45 +
> >  arch/arm64/include/asm/topology.h |  10 +
> >  arch/arm64/kernel/pci.c   |  10 +
> >  arch/arm64/kernel/setup.c |   4 +
> >  arch/arm64/kernel/smp.c   |   4 +
> >  arch/arm64/mm/Makefile|   1 +
> >  arch/arm64/mm/init.c  |  34 +++-
> >  arch/arm64/mm/mmu.c   |   1 +
> >  arch/arm64/mm/numa.c  | 396 
> > ++
> >  11 files changed, 538 insertions(+), 5 deletions(-)
> >  create mode 100644 arch/arm64/include/asm/mmzone.h
> >  create mode 100644 arch/arm64/include/asm/numa.h
> >  create mode 100644 arch/arm64/mm/numa.c
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 39f2203..7013087 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -74,6 +74,7 @@ config ARM64
> > select HAVE_HW_BREAKPOINT if PERF_EVENTS
> > select HAVE_IRQ_TIME_ACCOUNTING
> > select HAVE_MEMBLOCK
> > +   select HAVE_MEMBLOCK_NODE_MAP if NUMA
> > select HAVE_PATA_PLATFORM
> > select HAVE_PERF_EVENTS
> > select HAVE_PERF_REGS
> > @@ -96,6 +97,7 @@ config ARM64
> > select SYSCTL_EXCEPTION_TRACE
> > select HAVE_CONTEXT_TRACKING
> > select HAVE_ARM_SMCCC
> > +   select OF_NUMA if NUMA && OF
> > help
> >   ARM 64-bit (AArch64) Linux support.
> >  
> > @@ -545,6 +547,30 @@ config HOTPLUG_CPU
> >   Say Y here to experiment with turning CPUs off and on.  CPUs
> >   can be controlled through /sys/devices/system/cpu.
> >  
> > +# Common NUMA Features
> > +config NUMA
> > +   bool "Numa Memory Allocation and Scheduler Support"
> > +   depends on SMP
> > +   help
> > + Enable NUMA (Non Uniform Memory Access) support.
> > +
> > + The kernel will try to allocate memory used by a CPU on the
> > + local memory of the CPU and add some more
> > + NUMA awareness to the kernel.
> > +
> > +config NODES_SHIFT
> > +   int "Maximum NUMA Nodes (as a power of 2)"
> > +   range 1 10
> > +   default "2"
> > +   depends on NEED_MULTIPLE_NODES
> > +   help
> > + Specify the maximum number of NUMA Nodes available on the target
> > + system.  Increases memory reserved to accommodate various tables.
> > +
> > +config USE_PERCPU_NUMA_NODE_ID
> > +   def_bool y
> > +   depends on NUMA
> > +
> >  source kernel/Kconfig.preempt
> >  source kernel/Kconfig.hz
> >  
> > diff --git a/arch/arm64/include/asm/mmzone.h 
> > b/arch/arm64/include/asm/mmzone.h
> > new file mode 100644
> > index 000..a0de9e6
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/mmzone.h
> > @@ -0,0 +1,12 @@
> > +#ifndef __ASM_MMZONE_H
> > +#define __ASM_MMZONE_H
> > +
> > +#ifdef CONFIG_NUMA
> > +
> > +#include 
> > +
> > +extern struct pglist_data *node_data[];
> > +#define NODE_DATA(nid) (node_data[(nid)])
> > +
> > +#endif /* CONFIG_NUMA */
> > +#endif /* __ASM_MMZONE_H */
> > diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
> > new file mode 100644
> > index 000..e9b4f29
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/numa.h
> > @@ -0,0 +1,45 @@
> > +#ifndef __ASM_NUMA_H
> > +#define __ASM_NUMA_H
> > +
> > +#include 
> > +
> > +#ifdef CONFIG_NUMA
> > +
> > +/* currently, arm64 implements flat NUMA topology */
> > +#define parent_node(node)  (node)
> > +
> > +int __node_distance(int from, int to);
> > +#define node_dist

Re: [PATCH v15 5/6] arm64, numa: Add NUMA support for arm64 platforms.

2016-04-13 Thread Steve Capper
On Tue, Mar 08, 2016 at 11:59:46PM +, David Daney wrote:
> From: Ganapatrao Kulkarni <gkulka...@caviumnetworks.com>
> 
> Attempt to get the memory and CPU NUMA node via of_numa.  If that
> fails, default the dummy NUMA node and map all memory and CPUs to node
> 0.
> 
> Tested-by: Shannon Zhao <shannon.z...@linaro.org>
> Reviewed-by: Robert Richter <rrich...@cavium.com>
> Signed-off-by: Ganapatrao Kulkarni <gkulka...@caviumnetworks.com>
> Signed-off-by: David Daney <david.da...@cavium.com>

Hi David,

I have one minor comment below, but please feel free to add:
Acked-by: Steve Capper <steve.cap...@arm.com>

Cheers,
-- 
Steve

> ---
>  arch/arm64/Kconfig|  26 +++
>  arch/arm64/include/asm/mmzone.h   |  12 ++
>  arch/arm64/include/asm/numa.h |  45 +
>  arch/arm64/include/asm/topology.h |  10 +
>  arch/arm64/kernel/pci.c   |  10 +
>  arch/arm64/kernel/setup.c |   4 +
>  arch/arm64/kernel/smp.c   |   4 +
>  arch/arm64/mm/Makefile|   1 +
>  arch/arm64/mm/init.c  |  34 +++-
>  arch/arm64/mm/mmu.c   |   1 +
>  arch/arm64/mm/numa.c  | 396 
> ++
>  11 files changed, 538 insertions(+), 5 deletions(-)
>  create mode 100644 arch/arm64/include/asm/mmzone.h
>  create mode 100644 arch/arm64/include/asm/numa.h
>  create mode 100644 arch/arm64/mm/numa.c
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 39f2203..7013087 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -74,6 +74,7 @@ config ARM64
>   select HAVE_HW_BREAKPOINT if PERF_EVENTS
>   select HAVE_IRQ_TIME_ACCOUNTING
>   select HAVE_MEMBLOCK
> + select HAVE_MEMBLOCK_NODE_MAP if NUMA
>   select HAVE_PATA_PLATFORM
>   select HAVE_PERF_EVENTS
>   select HAVE_PERF_REGS
> @@ -96,6 +97,7 @@ config ARM64
>   select SYSCTL_EXCEPTION_TRACE
>   select HAVE_CONTEXT_TRACKING
>   select HAVE_ARM_SMCCC
> + select OF_NUMA if NUMA && OF
>   help
> ARM 64-bit (AArch64) Linux support.
>  
> @@ -545,6 +547,30 @@ config HOTPLUG_CPU
> Say Y here to experiment with turning CPUs off and on.  CPUs
> can be controlled through /sys/devices/system/cpu.
>  
> +# Common NUMA Features
> +config NUMA
> + bool "Numa Memory Allocation and Scheduler Support"
> + depends on SMP
> + help
> +   Enable NUMA (Non Uniform Memory Access) support.
> +
> +   The kernel will try to allocate memory used by a CPU on the
> +   local memory of the CPU and add some more
> +   NUMA awareness to the kernel.
> +
> +config NODES_SHIFT
> + int "Maximum NUMA Nodes (as a power of 2)"
> + range 1 10
> + default "2"
> + depends on NEED_MULTIPLE_NODES
> + help
> +   Specify the maximum number of NUMA Nodes available on the target
> +   system.  Increases memory reserved to accommodate various tables.
> +
> +config USE_PERCPU_NUMA_NODE_ID
> + def_bool y
> + depends on NUMA
> +
>  source kernel/Kconfig.preempt
>  source kernel/Kconfig.hz
>  
> diff --git a/arch/arm64/include/asm/mmzone.h b/arch/arm64/include/asm/mmzone.h
> new file mode 100644
> index 000..a0de9e6
> --- /dev/null
> +++ b/arch/arm64/include/asm/mmzone.h
> @@ -0,0 +1,12 @@
> +#ifndef __ASM_MMZONE_H
> +#define __ASM_MMZONE_H
> +
> +#ifdef CONFIG_NUMA
> +
> +#include 
> +
> +extern struct pglist_data *node_data[];
> +#define NODE_DATA(nid)   (node_data[(nid)])
> +
> +#endif /* CONFIG_NUMA */
> +#endif /* __ASM_MMZONE_H */
> diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
> new file mode 100644
> index 000..e9b4f29
> --- /dev/null
> +++ b/arch/arm64/include/asm/numa.h
> @@ -0,0 +1,45 @@
> +#ifndef __ASM_NUMA_H
> +#define __ASM_NUMA_H
> +
> +#include 
> +
> +#ifdef CONFIG_NUMA
> +
> +/* currently, arm64 implements flat NUMA topology */
> +#define parent_node(node)(node)
> +
> +int __node_distance(int from, int to);
> +#define node_distance(a, b) __node_distance(a, b)
> +
> +extern nodemask_t numa_nodes_parsed __initdata;
> +
> +/* Mappings between node number and cpus on that node. */
> +extern cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
> +void numa_clear_node(unsigned int cpu);
> +
> +#ifdef CONFIG_DEBUG_PER_CPU_MAPS
> +const struct cpumask *cpumask_of_node(int node);
> +#else
> +/* Returns a pointer to the cpumask of CPUs on Node 'node'. */
> +static inline const struct cpumask *cpumask_of_node(int node)
> +{
> + return node_to_cpumask_map[node];
> 

Re: [PATCH v15 5/6] arm64, numa: Add NUMA support for arm64 platforms.

2016-04-13 Thread Steve Capper
On Tue, Mar 08, 2016 at 11:59:46PM +, David Daney wrote:
> From: Ganapatrao Kulkarni 
> 
> Attempt to get the memory and CPU NUMA node via of_numa.  If that
> fails, default the dummy NUMA node and map all memory and CPUs to node
> 0.
> 
> Tested-by: Shannon Zhao 
> Reviewed-by: Robert Richter 
> Signed-off-by: Ganapatrao Kulkarni 
> Signed-off-by: David Daney 

Hi David,

I have one minor comment below, but please feel free to add:
Acked-by: Steve Capper 

Cheers,
-- 
Steve

> ---
>  arch/arm64/Kconfig|  26 +++
>  arch/arm64/include/asm/mmzone.h   |  12 ++
>  arch/arm64/include/asm/numa.h |  45 +
>  arch/arm64/include/asm/topology.h |  10 +
>  arch/arm64/kernel/pci.c   |  10 +
>  arch/arm64/kernel/setup.c |   4 +
>  arch/arm64/kernel/smp.c   |   4 +
>  arch/arm64/mm/Makefile|   1 +
>  arch/arm64/mm/init.c  |  34 +++-
>  arch/arm64/mm/mmu.c   |   1 +
>  arch/arm64/mm/numa.c  | 396 
> ++
>  11 files changed, 538 insertions(+), 5 deletions(-)
>  create mode 100644 arch/arm64/include/asm/mmzone.h
>  create mode 100644 arch/arm64/include/asm/numa.h
>  create mode 100644 arch/arm64/mm/numa.c
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 39f2203..7013087 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -74,6 +74,7 @@ config ARM64
>   select HAVE_HW_BREAKPOINT if PERF_EVENTS
>   select HAVE_IRQ_TIME_ACCOUNTING
>   select HAVE_MEMBLOCK
> + select HAVE_MEMBLOCK_NODE_MAP if NUMA
>   select HAVE_PATA_PLATFORM
>   select HAVE_PERF_EVENTS
>   select HAVE_PERF_REGS
> @@ -96,6 +97,7 @@ config ARM64
>   select SYSCTL_EXCEPTION_TRACE
>   select HAVE_CONTEXT_TRACKING
>   select HAVE_ARM_SMCCC
> + select OF_NUMA if NUMA && OF
>   help
> ARM 64-bit (AArch64) Linux support.
>  
> @@ -545,6 +547,30 @@ config HOTPLUG_CPU
> Say Y here to experiment with turning CPUs off and on.  CPUs
> can be controlled through /sys/devices/system/cpu.
>  
> +# Common NUMA Features
> +config NUMA
> + bool "Numa Memory Allocation and Scheduler Support"
> + depends on SMP
> + help
> +   Enable NUMA (Non Uniform Memory Access) support.
> +
> +   The kernel will try to allocate memory used by a CPU on the
> +   local memory of the CPU and add some more
> +   NUMA awareness to the kernel.
> +
> +config NODES_SHIFT
> + int "Maximum NUMA Nodes (as a power of 2)"
> + range 1 10
> + default "2"
> + depends on NEED_MULTIPLE_NODES
> + help
> +   Specify the maximum number of NUMA Nodes available on the target
> +   system.  Increases memory reserved to accommodate various tables.
> +
> +config USE_PERCPU_NUMA_NODE_ID
> + def_bool y
> + depends on NUMA
> +
>  source kernel/Kconfig.preempt
>  source kernel/Kconfig.hz
>  
> diff --git a/arch/arm64/include/asm/mmzone.h b/arch/arm64/include/asm/mmzone.h
> new file mode 100644
> index 000..a0de9e6
> --- /dev/null
> +++ b/arch/arm64/include/asm/mmzone.h
> @@ -0,0 +1,12 @@
> +#ifndef __ASM_MMZONE_H
> +#define __ASM_MMZONE_H
> +
> +#ifdef CONFIG_NUMA
> +
> +#include 
> +
> +extern struct pglist_data *node_data[];
> +#define NODE_DATA(nid)   (node_data[(nid)])
> +
> +#endif /* CONFIG_NUMA */
> +#endif /* __ASM_MMZONE_H */
> diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
> new file mode 100644
> index 000..e9b4f29
> --- /dev/null
> +++ b/arch/arm64/include/asm/numa.h
> @@ -0,0 +1,45 @@
> +#ifndef __ASM_NUMA_H
> +#define __ASM_NUMA_H
> +
> +#include 
> +
> +#ifdef CONFIG_NUMA
> +
> +/* currently, arm64 implements flat NUMA topology */
> +#define parent_node(node)(node)
> +
> +int __node_distance(int from, int to);
> +#define node_distance(a, b) __node_distance(a, b)
> +
> +extern nodemask_t numa_nodes_parsed __initdata;
> +
> +/* Mappings between node number and cpus on that node. */
> +extern cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
> +void numa_clear_node(unsigned int cpu);
> +
> +#ifdef CONFIG_DEBUG_PER_CPU_MAPS
> +const struct cpumask *cpumask_of_node(int node);
> +#else
> +/* Returns a pointer to the cpumask of CPUs on Node 'node'. */
> +static inline const struct cpumask *cpumask_of_node(int node)
> +{
> + return node_to_cpumask_map[node];
> +}
> +#endif
> +
> +void __init arm64_numa_init(void);
> +int __init numa_add_memblk(int nodeid, u64 start, u64 end);
> +void __init numa_set_distance(int fr

Re: [PATCH] mm: Exclude HugeTLB pages from THP page_mapped logic

2016-04-01 Thread Steve Capper
Hi Andrew,

On Thu, Mar 31, 2016 at 04:06:50PM -0700, Andrew Morton wrote:
> On Tue, 29 Mar 2016 17:39:41 +0100 Steve Capper <steve.cap...@arm.com> wrote:
> 
> > HugeTLB pages cannot be split, thus use the compound_mapcount to
> > track rmaps.
> > 
> > Currently the page_mapped function will check the compound_mapcount, but
> 
> s/the page_mapped function/page_mapped()/.  It's so much simpler!

Thanks, agreed :-).

> 
> > will also go through the constituent pages of a THP compound page and
> > query the individual _mapcount's too.
> > 
> > Unfortunately, the page_mapped function does not distinguish between
> > HugeTLB and THP compound pages and assumes that a compound page always
> > needs to have HPAGE_PMD_NR pages querying.
> > 
> > For most cases when dealing with HugeTLB this is just inefficient, but
> > for scenarios where the HugeTLB page size is less than the pmd block
> > size (e.g. when using contiguous bit on ARM) this can lead to crashes.
> > 
> > This patch adjusts the page_mapped function such that we skip the
> > unnecessary THP reference checks for HugeTLB pages.
> > 
> > Fixes: e1534ae95004 ("mm: differentiate page_mapped() from page_mapcount() 
> > for compound pages")
> > Cc: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
> > Signed-off-by: Steve Capper <steve.cap...@arm.com>
> > ---
> > 
> > Hi,
> > 
> > This patch is my approach to fixing a problem that unearthed with
> > HugeTLB pages on arm64. We ran with PAGE_SIZE=64KB and placed down 32
> > contiguous ptes to create 2MB HugeTLB pages. (We can provide hints to
> > the MMU that page table entries are contiguous thus larger TLB entries
> > can be used to represent them).
> 
> So which kernel version(s) need this patch?  I think both 4.4 and 4.5
> will crash in this manner?  Should we backport the fix into 4.4.x and
> 4.5.x?

We de-activated the contiguous hint support just before 4.5 (as we ran
into the problem too late). So no kernels are currently crashing due to
this. If this goes in, we can then re-enable contiguous hint on ARM.

> 
> >
> > ...
> >
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1031,6 +1031,8 @@ static inline bool page_mapped(struct page *page)
> > page = compound_head(page);
> > if (atomic_read(compound_mapcount_ptr(page)) >= 0)
> > return true;
> > +   if (PageHuge(page))
> > +   return false;
> > for (i = 0; i < hpage_nr_pages(page); i++) {
> > if (atomic_read([i]._mapcount) >= 0)
> > return true;
> 
> page_mapped() is moronically huge.  Uninlining it saves 206 bytes per
> callsite. It has 40+ callsites.
> 
> 
> 
> 
> btw, is anyone else seeing this `make M=' breakage?
> 
> akpm3:/usr/src/25> make M=mm
> Makefile:679: Cannot use CONFIG_KCOV: -fsanitize-coverage=trace-pc is not 
> supported by compiler
> 
>   WARNING: Symbol version dump ./Module.symvers
>is missing; modules will have no dependencies and modversions.
> 
> make[1]: *** No rule to make target `mm/filemap.o', needed by 
> `mm/built-in.o'.  Stop.
> make: *** [_module_mm] Error 2
> 
> It's a post-4.5 thing.

Sorry I have not yet tried out KCOV.

> 
> 
> 
> From: Andrew Morton <a...@linux-foundation.org>
> Subject: mm: uninline page_mapped()
> 
> It's huge.  Uninlining it saves 206 bytes per callsite.  Shaves 4924 bytes
> from the x86_64 allmodconfig vmlinux.
> 
> Cc: Steve Capper <steve.cap...@arm.com>
> Cc: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
> Signed-off-by: Andrew Morton <a...@linux-foundation.org>
> ---

The below looks reasonable to me, I don't have any benchmarks handy to
test for a performance regression on this though.

> 
>  include/linux/mm.h |   21 +
>  mm/util.c  |   22 ++
>  2 files changed, 23 insertions(+), 20 deletions(-)
> 
> diff -puN include/linux/mm.h~mm-uninline-page_mapped include/linux/mm.h
> --- a/include/linux/mm.h~mm-uninline-page_mapped
> +++ a/include/linux/mm.h
> @@ -1019,26 +1019,7 @@ static inline pgoff_t page_file_index(st
>   return page->index;
>  }
>  
> -/*
> - * Return true if this page is mapped into pagetables.
> - * For compound page it returns true if any subpage of compound page is 
> mapped.
> - */
> -static inline bool page_mapped(struct page *page)
> -{
> - int i;
> - if (likely(!PageCompound(page)))
> - return atomic_read(>_mapcount) >= 0;
> - page = compound_head(page);
> -

Re: [PATCH] mm: Exclude HugeTLB pages from THP page_mapped logic

2016-04-01 Thread Steve Capper
Hi Andrew,

On Thu, Mar 31, 2016 at 04:06:50PM -0700, Andrew Morton wrote:
> On Tue, 29 Mar 2016 17:39:41 +0100 Steve Capper  wrote:
> 
> > HugeTLB pages cannot be split, thus use the compound_mapcount to
> > track rmaps.
> > 
> > Currently the page_mapped function will check the compound_mapcount, but
> 
> s/the page_mapped function/page_mapped()/.  It's so much simpler!

Thanks, agreed :-).

> 
> > will also go through the constituent pages of a THP compound page and
> > query the individual _mapcount's too.
> > 
> > Unfortunately, the page_mapped function does not distinguish between
> > HugeTLB and THP compound pages and assumes that a compound page always
> > needs to have HPAGE_PMD_NR pages querying.
> > 
> > For most cases when dealing with HugeTLB this is just inefficient, but
> > for scenarios where the HugeTLB page size is less than the pmd block
> > size (e.g. when using contiguous bit on ARM) this can lead to crashes.
> > 
> > This patch adjusts the page_mapped function such that we skip the
> > unnecessary THP reference checks for HugeTLB pages.
> > 
> > Fixes: e1534ae95004 ("mm: differentiate page_mapped() from page_mapcount() 
> > for compound pages")
> > Cc: Kirill A. Shutemov 
> > Signed-off-by: Steve Capper 
> > ---
> > 
> > Hi,
> > 
> > This patch is my approach to fixing a problem that unearthed with
> > HugeTLB pages on arm64. We ran with PAGE_SIZE=64KB and placed down 32
> > contiguous ptes to create 2MB HugeTLB pages. (We can provide hints to
> > the MMU that page table entries are contiguous thus larger TLB entries
> > can be used to represent them).
> 
> So which kernel version(s) need this patch?  I think both 4.4 and 4.5
> will crash in this manner?  Should we backport the fix into 4.4.x and
> 4.5.x?

We de-activated the contiguous hint support just before 4.5 (as we ran
into the problem too late). So no kernels are currently crashing due to
this. If this goes in, we can then re-enable contiguous hint on ARM.

> 
> >
> > ...
> >
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1031,6 +1031,8 @@ static inline bool page_mapped(struct page *page)
> > page = compound_head(page);
> > if (atomic_read(compound_mapcount_ptr(page)) >= 0)
> > return true;
> > +   if (PageHuge(page))
> > +   return false;
> > for (i = 0; i < hpage_nr_pages(page); i++) {
> > if (atomic_read([i]._mapcount) >= 0)
> > return true;
> 
> page_mapped() is moronically huge.  Uninlining it saves 206 bytes per
> callsite. It has 40+ callsites.
> 
> 
> 
> 
> btw, is anyone else seeing this `make M=' breakage?
> 
> akpm3:/usr/src/25> make M=mm
> Makefile:679: Cannot use CONFIG_KCOV: -fsanitize-coverage=trace-pc is not 
> supported by compiler
> 
>   WARNING: Symbol version dump ./Module.symvers
>is missing; modules will have no dependencies and modversions.
> 
> make[1]: *** No rule to make target `mm/filemap.o', needed by 
> `mm/built-in.o'.  Stop.
> make: *** [_module_mm] Error 2
> 
> It's a post-4.5 thing.

Sorry I have not yet tried out KCOV.

> 
> 
> 
> From: Andrew Morton 
> Subject: mm: uninline page_mapped()
> 
> It's huge.  Uninlining it saves 206 bytes per callsite.  Shaves 4924 bytes
> from the x86_64 allmodconfig vmlinux.
> 
> Cc: Steve Capper 
> Cc: Kirill A. Shutemov 
> Signed-off-by: Andrew Morton 
> ---

The below looks reasonable to me, I don't have any benchmarks handy to
test for a performance regression on this though.

> 
>  include/linux/mm.h |   21 +
>  mm/util.c  |   22 ++
>  2 files changed, 23 insertions(+), 20 deletions(-)
> 
> diff -puN include/linux/mm.h~mm-uninline-page_mapped include/linux/mm.h
> --- a/include/linux/mm.h~mm-uninline-page_mapped
> +++ a/include/linux/mm.h
> @@ -1019,26 +1019,7 @@ static inline pgoff_t page_file_index(st
>   return page->index;
>  }
>  
> -/*
> - * Return true if this page is mapped into pagetables.
> - * For compound page it returns true if any subpage of compound page is 
> mapped.
> - */
> -static inline bool page_mapped(struct page *page)
> -{
> - int i;
> - if (likely(!PageCompound(page)))
> - return atomic_read(>_mapcount) >= 0;
> - page = compound_head(page);
> - if (atomic_read(compound_mapcount_ptr(page)) >= 0)
> - return true;
> - if (PageHuge(page))
> - return false;
> - for (i = 0; i < hpage_nr_pages(page); i++) {

Re: [PATCH] mm: Exclude HugeTLB pages from THP page_mapped logic

2016-03-30 Thread Steve Capper
On Tue, Mar 29, 2016 at 07:51:49PM +0300, Kirill A. Shutemov wrote:
> On Tue, Mar 29, 2016 at 05:39:41PM +0100, Steve Capper wrote:
> > HugeTLB pages cannot be split, thus use the compound_mapcount to
> > track rmaps.
> > 
> > Currently the page_mapped function will check the compound_mapcount, but
> > will also go through the constituent pages of a THP compound page and
> > query the individual _mapcount's too.
> > 
> > Unfortunately, the page_mapped function does not distinguish between
> > HugeTLB and THP compound pages and assumes that a compound page always
> > needs to have HPAGE_PMD_NR pages querying.
> > 
> > For most cases when dealing with HugeTLB this is just inefficient, but
> > for scenarios where the HugeTLB page size is less than the pmd block
> > size (e.g. when using contiguous bit on ARM) this can lead to crashes.
> > 
> > This patch adjusts the page_mapped function such that we skip the
> > unnecessary THP reference checks for HugeTLB pages.
> > 
> > Fixes: e1534ae95004 ("mm: differentiate page_mapped() from page_mapcount() 
> > for compound pages")
> > Cc: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
> > Signed-off-by: Steve Capper <steve.cap...@arm.com>
> 
> Acked-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>

Thanks!

> 
> > ---
> > 
> > Hi,
> > 
> > This patch is my approach to fixing a problem that unearthed with
> > HugeTLB pages on arm64. We ran with PAGE_SIZE=64KB and placed down 32
> > contiguous ptes to create 2MB HugeTLB pages. (We can provide hints to
> > the MMU that page table entries are contiguous thus larger TLB entries
> > can be used to represent them).
> > 
> > The PMD_SIZE was 512MB thus the old version of page_mapped would read
> > through too many struct pages and lead to BUGs.
> > 
> > Original problem reported here:
> > http://lists.infradead.org/pipermail/linux-arm-kernel/2016-March/414657.html
> > 
> > Having examined the HugeTLB code, I understand that only the
> > compound_mapcount_ptr is used to track rmap presence so going through
> > the individual _mapcounts for HugeTLB pages is superfluous? Or should I
> > instead post a patch that changes hpage_nr_pages to use the compound
> > order?
> 
> I would not touch hpage_nr_page().
> 
> We probably need to introduce compound_nr_pages() or something to replace
> (1 << compound_order(page)) to be used independetely from thp/hugetlb
> pages.

Okay, I will stick with the approach in this patch. With HugeTLB we also
have hstate information to use.

> 
> > Also, for the sake of readability, would it be worth changing the
> > definition of PageTransHuge to refer to only THPs (not both HugeTLB
> > and THP)?
> 
> I don't think so.
> 
> That would have overhead, since we wound need to do function call inside
> PageTransHuge(). HugeTLB() is not inlinable.

Ahh, I hadn't considered that...

> 
> hugetlb deverges from rest of mm pretty early, so thp vs. hugetlb
> confusion is not that ofter. We just don't share enough codepath.

Thanks Kirill, agreed.

Cheers,
-- 
Steve


Re: [PATCH] mm: Exclude HugeTLB pages from THP page_mapped logic

2016-03-30 Thread Steve Capper
On Tue, Mar 29, 2016 at 07:51:49PM +0300, Kirill A. Shutemov wrote:
> On Tue, Mar 29, 2016 at 05:39:41PM +0100, Steve Capper wrote:
> > HugeTLB pages cannot be split, thus use the compound_mapcount to
> > track rmaps.
> > 
> > Currently the page_mapped function will check the compound_mapcount, but
> > will also go through the constituent pages of a THP compound page and
> > query the individual _mapcount's too.
> > 
> > Unfortunately, the page_mapped function does not distinguish between
> > HugeTLB and THP compound pages and assumes that a compound page always
> > needs to have HPAGE_PMD_NR pages querying.
> > 
> > For most cases when dealing with HugeTLB this is just inefficient, but
> > for scenarios where the HugeTLB page size is less than the pmd block
> > size (e.g. when using contiguous bit on ARM) this can lead to crashes.
> > 
> > This patch adjusts the page_mapped function such that we skip the
> > unnecessary THP reference checks for HugeTLB pages.
> > 
> > Fixes: e1534ae95004 ("mm: differentiate page_mapped() from page_mapcount() 
> > for compound pages")
> > Cc: Kirill A. Shutemov 
> > Signed-off-by: Steve Capper 
> 
> Acked-by: Kirill A. Shutemov 

Thanks!

> 
> > ---
> > 
> > Hi,
> > 
> > This patch is my approach to fixing a problem that unearthed with
> > HugeTLB pages on arm64. We ran with PAGE_SIZE=64KB and placed down 32
> > contiguous ptes to create 2MB HugeTLB pages. (We can provide hints to
> > the MMU that page table entries are contiguous thus larger TLB entries
> > can be used to represent them).
> > 
> > The PMD_SIZE was 512MB thus the old version of page_mapped would read
> > through too many struct pages and lead to BUGs.
> > 
> > Original problem reported here:
> > http://lists.infradead.org/pipermail/linux-arm-kernel/2016-March/414657.html
> > 
> > Having examined the HugeTLB code, I understand that only the
> > compound_mapcount_ptr is used to track rmap presence so going through
> > the individual _mapcounts for HugeTLB pages is superfluous? Or should I
> > instead post a patch that changes hpage_nr_pages to use the compound
> > order?
> 
> I would not touch hpage_nr_page().
> 
> We probably need to introduce compound_nr_pages() or something to replace
> (1 << compound_order(page)) to be used independetely from thp/hugetlb
> pages.

Okay, I will stick with the approach in this patch. With HugeTLB we also
have hstate information to use.

> 
> > Also, for the sake of readability, would it be worth changing the
> > definition of PageTransHuge to refer to only THPs (not both HugeTLB
> > and THP)?
> 
> I don't think so.
> 
> That would have overhead, since we wound need to do function call inside
> PageTransHuge(). HugeTLB() is not inlinable.

Ahh, I hadn't considered that...

> 
> hugetlb deverges from rest of mm pretty early, so thp vs. hugetlb
> confusion is not that ofter. We just don't share enough codepath.

Thanks Kirill, agreed.

Cheers,
-- 
Steve


[PATCH] mm: Exclude HugeTLB pages from THP page_mapped logic

2016-03-29 Thread Steve Capper
HugeTLB pages cannot be split, thus use the compound_mapcount to
track rmaps.

Currently the page_mapped function will check the compound_mapcount, but
will also go through the constituent pages of a THP compound page and
query the individual _mapcount's too.

Unfortunately, the page_mapped function does not distinguish between
HugeTLB and THP compound pages and assumes that a compound page always
needs to have HPAGE_PMD_NR pages querying.

For most cases when dealing with HugeTLB this is just inefficient, but
for scenarios where the HugeTLB page size is less than the pmd block
size (e.g. when using contiguous bit on ARM) this can lead to crashes.

This patch adjusts the page_mapped function such that we skip the
unnecessary THP reference checks for HugeTLB pages.

Fixes: e1534ae95004 ("mm: differentiate page_mapped() from page_mapcount() for 
compound pages")
Cc: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
Signed-off-by: Steve Capper <steve.cap...@arm.com>
---

Hi,

This patch is my approach to fixing a problem that unearthed with
HugeTLB pages on arm64. We ran with PAGE_SIZE=64KB and placed down 32
contiguous ptes to create 2MB HugeTLB pages. (We can provide hints to
the MMU that page table entries are contiguous thus larger TLB entries
can be used to represent them).

The PMD_SIZE was 512MB thus the old version of page_mapped would read
through too many struct pages and lead to BUGs.

Original problem reported here:
http://lists.infradead.org/pipermail/linux-arm-kernel/2016-March/414657.html

Having examined the HugeTLB code, I understand that only the
compound_mapcount_ptr is used to track rmap presence so going through
the individual _mapcounts for HugeTLB pages is superfluous? Or should I
instead post a patch that changes hpage_nr_pages to use the compound
order?

Also, for the sake of readability, would it be worth changing the
definition of PageTransHuge to refer to only THPs (not both HugeTLB
and THP)?

(I misinterpreted PageTransHuge in hpage_nr_pages initially which is one
reason this problem took me longer than normal to pin down this issue).

Cheers,
-- 
Steve

---
 include/linux/mm.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ed6407d..4b223dc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1031,6 +1031,8 @@ static inline bool page_mapped(struct page *page)
page = compound_head(page);
if (atomic_read(compound_mapcount_ptr(page)) >= 0)
return true;
+   if (PageHuge(page))
+   return false;
for (i = 0; i < hpage_nr_pages(page); i++) {
if (atomic_read([i]._mapcount) >= 0)
return true;
-- 
2.1.0



[PATCH] mm: Exclude HugeTLB pages from THP page_mapped logic

2016-03-29 Thread Steve Capper
HugeTLB pages cannot be split, thus use the compound_mapcount to
track rmaps.

Currently the page_mapped function will check the compound_mapcount, but
will also go through the constituent pages of a THP compound page and
query the individual _mapcount's too.

Unfortunately, the page_mapped function does not distinguish between
HugeTLB and THP compound pages and assumes that a compound page always
needs to have HPAGE_PMD_NR pages querying.

For most cases when dealing with HugeTLB this is just inefficient, but
for scenarios where the HugeTLB page size is less than the pmd block
size (e.g. when using contiguous bit on ARM) this can lead to crashes.

This patch adjusts the page_mapped function such that we skip the
unnecessary THP reference checks for HugeTLB pages.

Fixes: e1534ae95004 ("mm: differentiate page_mapped() from page_mapcount() for 
compound pages")
Cc: Kirill A. Shutemov 
Signed-off-by: Steve Capper 
---

Hi,

This patch is my approach to fixing a problem that unearthed with
HugeTLB pages on arm64. We ran with PAGE_SIZE=64KB and placed down 32
contiguous ptes to create 2MB HugeTLB pages. (We can provide hints to
the MMU that page table entries are contiguous thus larger TLB entries
can be used to represent them).

The PMD_SIZE was 512MB thus the old version of page_mapped would read
through too many struct pages and lead to BUGs.

Original problem reported here:
http://lists.infradead.org/pipermail/linux-arm-kernel/2016-March/414657.html

Having examined the HugeTLB code, I understand that only the
compound_mapcount_ptr is used to track rmap presence so going through
the individual _mapcounts for HugeTLB pages is superfluous? Or should I
instead post a patch that changes hpage_nr_pages to use the compound
order?

Also, for the sake of readability, would it be worth changing the
definition of PageTransHuge to refer to only THPs (not both HugeTLB
and THP)?

(I misinterpreted PageTransHuge in hpage_nr_pages initially which is one
reason this problem took me longer than normal to pin down this issue).

Cheers,
-- 
Steve

---
 include/linux/mm.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ed6407d..4b223dc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1031,6 +1031,8 @@ static inline bool page_mapped(struct page *page)
page = compound_head(page);
if (atomic_read(compound_mapcount_ptr(page)) >= 0)
return true;
+   if (PageHuge(page))
+   return false;
for (i = 0; i < hpage_nr_pages(page); i++) {
if (atomic_read([i]._mapcount) >= 0)
return true;
-- 
2.1.0



Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-25 Thread Steve Capper
On 25 February 2016 at 16:01, Kirill A. Shutemov <kir...@shutemov.name> wrote:
> On Thu, Feb 25, 2016 at 03:49:33PM +0000, Steve Capper wrote:
>> On 23 February 2016 at 18:47, Will Deacon <will.dea...@arm.com> wrote:
>> > [adding Steve, since he worked on THP for 32-bit ARM]
>>
>> Apologies for my late reply...
>>
>> >
>> > On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
>> >> On Tue, 23 Feb 2016 13:32:21 +0300
>> >> "Kirill A. Shutemov" <kir...@shutemov.name> wrote:
>> >> > The theory is that the splitting bit effetely masked bogus 
>> >> > pmd_present():
>> >> > we had pmd_trans_splitting() in all code path and that prevented mm from
>> >> > touching the pmd. Once pmd_trans_splitting() has gone, mm proceed with 
>> >> > the
>> >> > pmd where it shouldn't and here's a boom.
>> >>
>> >> Well, I don't think pmd_present() == true is bogus for a trans_huge pmd 
>> >> under
>> >> splitting, after all there is a page behind the the pmd. Also, if it was
>> >> bogus, and it would need to be false, why should it be marked 
>> >> !pmd_present()
>> >> only at the pmdp_invalidate() step before the pmd_populate()? It clearly
>> >> is pmd_present() before that, on all architectures, and if there was any
>> >> problem/race with that, setting it to !pmd_present() at this stage would
>> >> only (marginally) reduce the race window.
>> >>
>> >> BTW, PowerPC and Sparc seem to do the same thing in pmdp_invalidate(),
>> >> i.e. they do not set pmd_present() == false, only mark it so that it would
>> >> not generate a new TLB entry, just like on s390. After all, the function
>> >> is called pmdp_invalidate(), and I think the comment in mm/huge_memory.c
>> >> before that call is just a little ambiguous in its wording. When it says
>> >> "mark the pmd notpresent" it probably means "mark it so that it will not
>> >> generate a new TLB entry", which is also what the comment is really about:
>> >> prevent huge and small entries in the TLB for the same page at the same
>> >> time.
>> >>
>> >> FWIW, and since the ARM arch-list is already on cc, I think there is
>> >> an issue with pmdp_invalidate() on ARM, since it also seems to clear
>> >> the trans_huge (and formerly trans_splitting) bit, which actually makes
>> >> the pmd !pmd_present(), but it violates the other requirement from the
>> >> comment:
>> >> "the pmd_trans_huge and pmd_trans_splitting must remain set at all times
>> >> on the pmd until the split is complete for this pmd"
>> >
>> > I've only been testing this for arm64 (where I'm yet to see a problem),
>> > but we use the generic pmdp_invalidate implementation from
>> > mm/pgtable-generic.c there. On arm64, pmd_trans_huge will return true
>> > after pmd_mknotpresent. On arm, it does look to be buggy, since it nukes
>> > the entire entry... Steve?
>>
>> pmd_mknotpresent on arm looks inconsistent with the other
>> architectures and can be changed.
>>
>> Having had a look at the usage, I can't see it causing an immediate
>> problem (that needs to be addressed by an emergency patch).
>> We don't have a notion of splitting pmds (so there is no splitting
>> information to lose), and the only usage I could see of
>> pmd_mknotpresent was:
>>
>> pmdp_invalidate(vma, haddr, pmd);
>> pmd_populate(mm, pmd, pgtable);
>>
>> In mm/huge_memory.c, around line 3588.
>>
>> So we invalidate the entry (which puts down a faulting entry from
>> pmd_mknotpresent and invalidates tlb), then immediately put down a
>> table entry with pmd_populate.
>>
>> I have run a 32-bit ARM test kernel and exacerbated THP splits (that's
>> what took me time), and I didn't notice any problems with 4.5-rc5.
>
> If I read code correctly, your pmd_mknotpresent() makes the pmd
> pmd_none(), right? If yes, it's a problem.
>
> It introduces race I've described here:
>
> https://marc.info/?l=linux-mm=144723658100512=4
>
> Basically, if zap_pmd_range() would see pmd_none() between
> pmdp_mknotpresent() and pmd_populate(), we're screwed.
>
> The race window is small, but it's there.

A, okay, thank you Kirill.
I agree, I'll get a patch out.

Cheers,
--
Steve


Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-25 Thread Steve Capper
On 25 February 2016 at 16:01, Kirill A. Shutemov  wrote:
> On Thu, Feb 25, 2016 at 03:49:33PM +0000, Steve Capper wrote:
>> On 23 February 2016 at 18:47, Will Deacon  wrote:
>> > [adding Steve, since he worked on THP for 32-bit ARM]
>>
>> Apologies for my late reply...
>>
>> >
>> > On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
>> >> On Tue, 23 Feb 2016 13:32:21 +0300
>> >> "Kirill A. Shutemov"  wrote:
>> >> > The theory is that the splitting bit effetely masked bogus 
>> >> > pmd_present():
>> >> > we had pmd_trans_splitting() in all code path and that prevented mm from
>> >> > touching the pmd. Once pmd_trans_splitting() has gone, mm proceed with 
>> >> > the
>> >> > pmd where it shouldn't and here's a boom.
>> >>
>> >> Well, I don't think pmd_present() == true is bogus for a trans_huge pmd 
>> >> under
>> >> splitting, after all there is a page behind the the pmd. Also, if it was
>> >> bogus, and it would need to be false, why should it be marked 
>> >> !pmd_present()
>> >> only at the pmdp_invalidate() step before the pmd_populate()? It clearly
>> >> is pmd_present() before that, on all architectures, and if there was any
>> >> problem/race with that, setting it to !pmd_present() at this stage would
>> >> only (marginally) reduce the race window.
>> >>
>> >> BTW, PowerPC and Sparc seem to do the same thing in pmdp_invalidate(),
>> >> i.e. they do not set pmd_present() == false, only mark it so that it would
>> >> not generate a new TLB entry, just like on s390. After all, the function
>> >> is called pmdp_invalidate(), and I think the comment in mm/huge_memory.c
>> >> before that call is just a little ambiguous in its wording. When it says
>> >> "mark the pmd notpresent" it probably means "mark it so that it will not
>> >> generate a new TLB entry", which is also what the comment is really about:
>> >> prevent huge and small entries in the TLB for the same page at the same
>> >> time.
>> >>
>> >> FWIW, and since the ARM arch-list is already on cc, I think there is
>> >> an issue with pmdp_invalidate() on ARM, since it also seems to clear
>> >> the trans_huge (and formerly trans_splitting) bit, which actually makes
>> >> the pmd !pmd_present(), but it violates the other requirement from the
>> >> comment:
>> >> "the pmd_trans_huge and pmd_trans_splitting must remain set at all times
>> >> on the pmd until the split is complete for this pmd"
>> >
>> > I've only been testing this for arm64 (where I'm yet to see a problem),
>> > but we use the generic pmdp_invalidate implementation from
>> > mm/pgtable-generic.c there. On arm64, pmd_trans_huge will return true
>> > after pmd_mknotpresent. On arm, it does look to be buggy, since it nukes
>> > the entire entry... Steve?
>>
>> pmd_mknotpresent on arm looks inconsistent with the other
>> architectures and can be changed.
>>
>> Having had a look at the usage, I can't see it causing an immediate
>> problem (that needs to be addressed by an emergency patch).
>> We don't have a notion of splitting pmds (so there is no splitting
>> information to lose), and the only usage I could see of
>> pmd_mknotpresent was:
>>
>> pmdp_invalidate(vma, haddr, pmd);
>> pmd_populate(mm, pmd, pgtable);
>>
>> In mm/huge_memory.c, around line 3588.
>>
>> So we invalidate the entry (which puts down a faulting entry from
>> pmd_mknotpresent and invalidates tlb), then immediately put down a
>> table entry with pmd_populate.
>>
>> I have run a 32-bit ARM test kernel and exacerbated THP splits (that's
>> what took me time), and I didn't notice any problems with 4.5-rc5.
>
> If I read code correctly, your pmd_mknotpresent() makes the pmd
> pmd_none(), right? If yes, it's a problem.
>
> It introduces race I've described here:
>
> https://marc.info/?l=linux-mm=144723658100512=4
>
> Basically, if zap_pmd_range() would see pmd_none() between
> pmdp_mknotpresent() and pmd_populate(), we're screwed.
>
> The race window is small, but it's there.

A, okay, thank you Kirill.
I agree, I'll get a patch out.

Cheers,
--
Steve


Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-25 Thread Steve Capper
On 23 February 2016 at 18:47, Will Deacon  wrote:
> [adding Steve, since he worked on THP for 32-bit ARM]

Apologies for my late reply...

>
> On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
>> On Tue, 23 Feb 2016 13:32:21 +0300
>> "Kirill A. Shutemov"  wrote:
>> > The theory is that the splitting bit effetely masked bogus pmd_present():
>> > we had pmd_trans_splitting() in all code path and that prevented mm from
>> > touching the pmd. Once pmd_trans_splitting() has gone, mm proceed with the
>> > pmd where it shouldn't and here's a boom.
>>
>> Well, I don't think pmd_present() == true is bogus for a trans_huge pmd under
>> splitting, after all there is a page behind the the pmd. Also, if it was
>> bogus, and it would need to be false, why should it be marked !pmd_present()
>> only at the pmdp_invalidate() step before the pmd_populate()? It clearly
>> is pmd_present() before that, on all architectures, and if there was any
>> problem/race with that, setting it to !pmd_present() at this stage would
>> only (marginally) reduce the race window.
>>
>> BTW, PowerPC and Sparc seem to do the same thing in pmdp_invalidate(),
>> i.e. they do not set pmd_present() == false, only mark it so that it would
>> not generate a new TLB entry, just like on s390. After all, the function
>> is called pmdp_invalidate(), and I think the comment in mm/huge_memory.c
>> before that call is just a little ambiguous in its wording. When it says
>> "mark the pmd notpresent" it probably means "mark it so that it will not
>> generate a new TLB entry", which is also what the comment is really about:
>> prevent huge and small entries in the TLB for the same page at the same
>> time.
>>
>> FWIW, and since the ARM arch-list is already on cc, I think there is
>> an issue with pmdp_invalidate() on ARM, since it also seems to clear
>> the trans_huge (and formerly trans_splitting) bit, which actually makes
>> the pmd !pmd_present(), but it violates the other requirement from the
>> comment:
>> "the pmd_trans_huge and pmd_trans_splitting must remain set at all times
>> on the pmd until the split is complete for this pmd"
>
> I've only been testing this for arm64 (where I'm yet to see a problem),
> but we use the generic pmdp_invalidate implementation from
> mm/pgtable-generic.c there. On arm64, pmd_trans_huge will return true
> after pmd_mknotpresent. On arm, it does look to be buggy, since it nukes
> the entire entry... Steve?

pmd_mknotpresent on arm looks inconsistent with the other
architectures and can be changed.

Having had a look at the usage, I can't see it causing an immediate
problem (that needs to be addressed by an emergency patch).
We don't have a notion of splitting pmds (so there is no splitting
information to lose), and the only usage I could see of
pmd_mknotpresent was:

pmdp_invalidate(vma, haddr, pmd);
pmd_populate(mm, pmd, pgtable);

In mm/huge_memory.c, around line 3588.

So we invalidate the entry (which puts down a faulting entry from
pmd_mknotpresent and invalidates tlb), then immediately put down a
table entry with pmd_populate.

I have run a 32-bit ARM test kernel and exacerbated THP splits (that's
what took me time), and I didn't notice any problems with 4.5-rc5.

Cheers,
-- 
Steve

>
> Will
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-25 Thread Steve Capper
On 23 February 2016 at 18:47, Will Deacon  wrote:
> [adding Steve, since he worked on THP for 32-bit ARM]

Apologies for my late reply...

>
> On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
>> On Tue, 23 Feb 2016 13:32:21 +0300
>> "Kirill A. Shutemov"  wrote:
>> > The theory is that the splitting bit effetely masked bogus pmd_present():
>> > we had pmd_trans_splitting() in all code path and that prevented mm from
>> > touching the pmd. Once pmd_trans_splitting() has gone, mm proceed with the
>> > pmd where it shouldn't and here's a boom.
>>
>> Well, I don't think pmd_present() == true is bogus for a trans_huge pmd under
>> splitting, after all there is a page behind the the pmd. Also, if it was
>> bogus, and it would need to be false, why should it be marked !pmd_present()
>> only at the pmdp_invalidate() step before the pmd_populate()? It clearly
>> is pmd_present() before that, on all architectures, and if there was any
>> problem/race with that, setting it to !pmd_present() at this stage would
>> only (marginally) reduce the race window.
>>
>> BTW, PowerPC and Sparc seem to do the same thing in pmdp_invalidate(),
>> i.e. they do not set pmd_present() == false, only mark it so that it would
>> not generate a new TLB entry, just like on s390. After all, the function
>> is called pmdp_invalidate(), and I think the comment in mm/huge_memory.c
>> before that call is just a little ambiguous in its wording. When it says
>> "mark the pmd notpresent" it probably means "mark it so that it will not
>> generate a new TLB entry", which is also what the comment is really about:
>> prevent huge and small entries in the TLB for the same page at the same
>> time.
>>
>> FWIW, and since the ARM arch-list is already on cc, I think there is
>> an issue with pmdp_invalidate() on ARM, since it also seems to clear
>> the trans_huge (and formerly trans_splitting) bit, which actually makes
>> the pmd !pmd_present(), but it violates the other requirement from the
>> comment:
>> "the pmd_trans_huge and pmd_trans_splitting must remain set at all times
>> on the pmd until the split is complete for this pmd"
>
> I've only been testing this for arm64 (where I'm yet to see a problem),
> but we use the generic pmdp_invalidate implementation from
> mm/pgtable-generic.c there. On arm64, pmd_trans_huge will return true
> after pmd_mknotpresent. On arm, it does look to be buggy, since it nukes
> the entire entry... Steve?

pmd_mknotpresent on arm looks inconsistent with the other
architectures and can be changed.

Having had a look at the usage, I can't see it causing an immediate
problem (that needs to be addressed by an emergency patch).
We don't have a notion of splitting pmds (so there is no splitting
information to lose), and the only usage I could see of
pmd_mknotpresent was:

pmdp_invalidate(vma, haddr, pmd);
pmd_populate(mm, pmd, pgtable);

In mm/huge_memory.c, around line 3588.

So we invalidate the entry (which puts down a faulting entry from
pmd_mknotpresent and invalidates tlb), then immediately put down a
table entry with pmd_populate.

I have run a 32-bit ARM test kernel and exacerbated THP splits (that's
what took me time), and I didn't notice any problems with 4.5-rc5.

Cheers,
-- 
Steve

>
> Will
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


Re: [PATCH v5] arm64: Add support for PTE contiguous bit.

2015-12-20 Thread Steve Capper
On 17 December 2015 at 19:31, David Woods  wrote:
> The arm64 MMU supports a Contiguous bit which is a hint that the TTE
> is one of a set of contiguous entries which can be cached in a single
> TLB entry.  Supporting this bit adds new intermediate huge page sizes.
>
> The set of huge page sizes available depends on the base page size.
> Without using contiguous pages the huge page sizes are as follows.
>
>  4KB:   2MB  1GB
> 64KB: 512MB
>
> With a 4KB granule, the contiguous bit groups together sets of 16 pages
> and with a 64KB granule it groups sets of 32 pages.  This enables two new
> huge page sizes in each case, so that the full set of available sizes
> is as follows.
>
>  4KB:  64KB   2MB  32MB  1GB
> 64KB:   2MB 512MB  16GB
>
> If a 16KB granule is used then the contiguous bit groups 128 pages
> at the PTE level and 32 pages at the PMD level.
>
> If the base page size is set to 64KB then 2MB pages are enabled by
> default.  It is possible in the future to make 2MB the default huge
> page size for both 4KB and 64KB granules.
>
> Signed-off-by: David Woods 
> Reviewed-by: Chris Metcalf 

Thanks for this David, this looks great to me. Please add:
Reviewed-by: Steve Capper 

...and have a great Christmas break.

> ---
>
> Version 5 cleans up issues building with STRICT_MM_TYPECHECKS defined
> pointed out by Steve Capper.
>
>  arch/arm64/Kconfig |   3 -
>  arch/arm64/include/asm/hugetlb.h   |  44 ++
>  arch/arm64/include/asm/pgtable-hwdef.h |  18 ++-
>  arch/arm64/include/asm/pgtable.h   |  10 +-
>  arch/arm64/mm/hugetlbpage.c| 274 
> -
>  include/linux/hugetlb.h|   2 -
>  6 files changed, 313 insertions(+), 38 deletions(-)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 4876459..ffa3c54 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -530,9 +530,6 @@ config HW_PERF_EVENTS
>  config SYS_SUPPORTS_HUGETLBFS
> def_bool y
>
> -config ARCH_WANT_GENERAL_HUGETLB
> -   def_bool y
> -
>  config ARCH_WANT_HUGE_PMD_SHARE
> def_bool y if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
>
> diff --git a/arch/arm64/include/asm/hugetlb.h 
> b/arch/arm64/include/asm/hugetlb.h
> index bb4052e..bbc1e35 100644
> --- a/arch/arm64/include/asm/hugetlb.h
> +++ b/arch/arm64/include/asm/hugetlb.h
> @@ -26,36 +26,7 @@ static inline pte_t huge_ptep_get(pte_t *ptep)
> return *ptep;
>  }
>
> -static inline void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
> -  pte_t *ptep, pte_t pte)
> -{
> -   set_pte_at(mm, addr, ptep, pte);
> -}
> -
> -static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,
> -unsigned long addr, pte_t *ptep)
> -{
> -   ptep_clear_flush(vma, addr, ptep);
> -}
> -
> -static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
> -  unsigned long addr, pte_t *ptep)
> -{
> -   ptep_set_wrprotect(mm, addr, ptep);
> -}
>
> -static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
> -   unsigned long addr, pte_t *ptep)
> -{
> -   return ptep_get_and_clear(mm, addr, ptep);
> -}
> -
> -static inline int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> -unsigned long addr, pte_t *ptep,
> -pte_t pte, int dirty)
> -{
> -   return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
> -}
>
>  static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb,
>   unsigned long addr, unsigned long 
> end,
> @@ -97,4 +68,19 @@ static inline void arch_clear_hugepage_flags(struct page 
> *page)
> clear_bit(PG_dcache_clean, >flags);
>  }
>
> +extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
> +   struct page *page, int writable);
> +#define arch_make_huge_pte arch_make_huge_pte
> +extern void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
> +   pte_t *ptep, pte_t pte);
> +extern int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep,
> + pte_t pte, int dirty);
> +extern pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
> +unsigned long addr, pte_t *ptep);
> +extern void huge_ptep_set_wrprotect(struct mm_struct *mm,
> + 

Re: [PATCH v5] arm64: Add support for PTE contiguous bit.

2015-12-20 Thread Steve Capper
On 17 December 2015 at 19:31, David Woods <dwo...@ezchip.com> wrote:
> The arm64 MMU supports a Contiguous bit which is a hint that the TTE
> is one of a set of contiguous entries which can be cached in a single
> TLB entry.  Supporting this bit adds new intermediate huge page sizes.
>
> The set of huge page sizes available depends on the base page size.
> Without using contiguous pages the huge page sizes are as follows.
>
>  4KB:   2MB  1GB
> 64KB: 512MB
>
> With a 4KB granule, the contiguous bit groups together sets of 16 pages
> and with a 64KB granule it groups sets of 32 pages.  This enables two new
> huge page sizes in each case, so that the full set of available sizes
> is as follows.
>
>  4KB:  64KB   2MB  32MB  1GB
> 64KB:   2MB 512MB  16GB
>
> If a 16KB granule is used then the contiguous bit groups 128 pages
> at the PTE level and 32 pages at the PMD level.
>
> If the base page size is set to 64KB then 2MB pages are enabled by
> default.  It is possible in the future to make 2MB the default huge
> page size for both 4KB and 64KB granules.
>
> Signed-off-by: David Woods <dwo...@ezchip.com>
> Reviewed-by: Chris Metcalf <cmetc...@ezchip.com>

Thanks for this David, this looks great to me. Please add:
Reviewed-by: Steve Capper <steve.cap...@linaro.org>

...and have a great Christmas break.

> ---
>
> Version 5 cleans up issues building with STRICT_MM_TYPECHECKS defined
> pointed out by Steve Capper.
>
>  arch/arm64/Kconfig |   3 -
>  arch/arm64/include/asm/hugetlb.h   |  44 ++
>  arch/arm64/include/asm/pgtable-hwdef.h |  18 ++-
>  arch/arm64/include/asm/pgtable.h   |  10 +-
>  arch/arm64/mm/hugetlbpage.c| 274 
> -
>  include/linux/hugetlb.h|   2 -
>  6 files changed, 313 insertions(+), 38 deletions(-)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 4876459..ffa3c54 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -530,9 +530,6 @@ config HW_PERF_EVENTS
>  config SYS_SUPPORTS_HUGETLBFS
> def_bool y
>
> -config ARCH_WANT_GENERAL_HUGETLB
> -   def_bool y
> -
>  config ARCH_WANT_HUGE_PMD_SHARE
> def_bool y if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
>
> diff --git a/arch/arm64/include/asm/hugetlb.h 
> b/arch/arm64/include/asm/hugetlb.h
> index bb4052e..bbc1e35 100644
> --- a/arch/arm64/include/asm/hugetlb.h
> +++ b/arch/arm64/include/asm/hugetlb.h
> @@ -26,36 +26,7 @@ static inline pte_t huge_ptep_get(pte_t *ptep)
> return *ptep;
>  }
>
> -static inline void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
> -  pte_t *ptep, pte_t pte)
> -{
> -   set_pte_at(mm, addr, ptep, pte);
> -}
> -
> -static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,
> -unsigned long addr, pte_t *ptep)
> -{
> -   ptep_clear_flush(vma, addr, ptep);
> -}
> -
> -static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
> -  unsigned long addr, pte_t *ptep)
> -{
> -   ptep_set_wrprotect(mm, addr, ptep);
> -}
>
> -static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
> -   unsigned long addr, pte_t *ptep)
> -{
> -   return ptep_get_and_clear(mm, addr, ptep);
> -}
> -
> -static inline int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> -unsigned long addr, pte_t *ptep,
> -pte_t pte, int dirty)
> -{
> -   return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
> -}
>
>  static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb,
>   unsigned long addr, unsigned long 
> end,
> @@ -97,4 +68,19 @@ static inline void arch_clear_hugepage_flags(struct page 
> *page)
> clear_bit(PG_dcache_clean, >flags);
>  }
>
> +extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
> +   struct page *page, int writable);
> +#define arch_make_huge_pte arch_make_huge_pte
> +extern void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
> +   pte_t *ptep, pte_t pte);
> +extern int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep,
> + pte_t pte, int dirty);
> +extern pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
> +unsigned long add

Re: [PATCH v4] arm64: Add support for PTE contiguous bit.

2015-12-16 Thread Steve Capper
On 11 December 2015 at 21:02, David Woods  wrote:
> The arm64 MMU supports a Contiguous bit which is a hint that the TTE
> is one of a set of contiguous entries which can be cached in a single
> TLB entry.  Supporting this bit adds new intermediate huge page sizes.
>
> The set of huge page sizes available depends on the base page size.
> Without using contiguous pages the huge page sizes are as follows.
>
>  4KB:   2MB  1GB
> 64KB: 512MB
>
> With a 4KB granule, the contiguous bit groups together sets of 16 pages
> and with a 64KB granule it groups sets of 32 pages.  This enables two new
> huge page sizes in each case, so that the full set of available sizes
> is as follows.
>
>  4KB:  64KB   2MB  32MB  1GB
> 64KB:   2MB 512MB  16GB
>
> If a 16KB granule is used then the contiguous bit groups 128 pages
> at the PTE level and 32 pages at the PMD level.
>
> If the base page size is set to 64KB then 2MB pages are enabled by
> default.  It is possible in the future to make 2MB the default huge
> page size for both 4KB and 64KB granules.
>
> Signed-off-by: David Woods 
> Reviewed-by: Chris Metcalf 
> ---
>
> This version of the patch addresses all the comments I've received
> to date and is passing the libhugetlbfs tests.  Catalin, assuming
> there are no further comments, can this be considered for the arm64
> next tree?

Hi David,
Thanks for this revised series.

I have a few comments below. Most arose when I enabled STRICT_MM_TYPECHECKS.

I have tested this on my arm64 system with PAGE_SIZE==64KB, and it ran well.

Cheers,
--
Steve

>
>  arch/arm64/Kconfig |   3 -
>  arch/arm64/include/asm/hugetlb.h   |  44 ++
>  arch/arm64/include/asm/pgtable-hwdef.h |  18 ++-
>  arch/arm64/include/asm/pgtable.h   |  10 +-
>  arch/arm64/mm/hugetlbpage.c| 267 
> -
>  include/linux/hugetlb.h|   2 -
>  6 files changed, 306 insertions(+), 38 deletions(-)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 4876459..ffa3c54 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -530,9 +530,6 @@ config HW_PERF_EVENTS
>  config SYS_SUPPORTS_HUGETLBFS
> def_bool y
>
> -config ARCH_WANT_GENERAL_HUGETLB
> -   def_bool y
> -
>  config ARCH_WANT_HUGE_PMD_SHARE
> def_bool y if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
>
> diff --git a/arch/arm64/include/asm/hugetlb.h 
> b/arch/arm64/include/asm/hugetlb.h
> index bb4052e..bbc1e35 100644
> --- a/arch/arm64/include/asm/hugetlb.h
> +++ b/arch/arm64/include/asm/hugetlb.h
> @@ -26,36 +26,7 @@ static inline pte_t huge_ptep_get(pte_t *ptep)
> return *ptep;
>  }
>
> -static inline void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
> -  pte_t *ptep, pte_t pte)
> -{
> -   set_pte_at(mm, addr, ptep, pte);
> -}
> -
> -static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,
> -unsigned long addr, pte_t *ptep)
> -{
> -   ptep_clear_flush(vma, addr, ptep);
> -}
> -
> -static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
> -  unsigned long addr, pte_t *ptep)
> -{
> -   ptep_set_wrprotect(mm, addr, ptep);
> -}
>
> -static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
> -   unsigned long addr, pte_t *ptep)
> -{
> -   return ptep_get_and_clear(mm, addr, ptep);
> -}
> -
> -static inline int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> -unsigned long addr, pte_t *ptep,
> -pte_t pte, int dirty)
> -{
> -   return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
> -}
>
>  static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb,
>   unsigned long addr, unsigned long 
> end,
> @@ -97,4 +68,19 @@ static inline void arch_clear_hugepage_flags(struct page 
> *page)
> clear_bit(PG_dcache_clean, >flags);
>  }
>
> +extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
> +   struct page *page, int writable);
> +#define arch_make_huge_pte arch_make_huge_pte
> +extern void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
> +   pte_t *ptep, pte_t pte);
> +extern int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep,
> + pte_t pte, int dirty);
> +extern pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
> +unsigned long addr, pte_t *ptep);
> +extern void huge_ptep_set_wrprotect(struct mm_struct *mm,
> +   unsigned long addr, pte_t *ptep);
> +extern void huge_ptep_clear_flush(struct vm_area_struct *vma,
> + 

Re: [PATCH v4] arm64: Add support for PTE contiguous bit.

2015-12-16 Thread Steve Capper
On 11 December 2015 at 21:02, David Woods  wrote:
> The arm64 MMU supports a Contiguous bit which is a hint that the TTE
> is one of a set of contiguous entries which can be cached in a single
> TLB entry.  Supporting this bit adds new intermediate huge page sizes.
>
> The set of huge page sizes available depends on the base page size.
> Without using contiguous pages the huge page sizes are as follows.
>
>  4KB:   2MB  1GB
> 64KB: 512MB
>
> With a 4KB granule, the contiguous bit groups together sets of 16 pages
> and with a 64KB granule it groups sets of 32 pages.  This enables two new
> huge page sizes in each case, so that the full set of available sizes
> is as follows.
>
>  4KB:  64KB   2MB  32MB  1GB
> 64KB:   2MB 512MB  16GB
>
> If a 16KB granule is used then the contiguous bit groups 128 pages
> at the PTE level and 32 pages at the PMD level.
>
> If the base page size is set to 64KB then 2MB pages are enabled by
> default.  It is possible in the future to make 2MB the default huge
> page size for both 4KB and 64KB granules.
>
> Signed-off-by: David Woods 
> Reviewed-by: Chris Metcalf 
> ---
>
> This version of the patch addresses all the comments I've received
> to date and is passing the libhugetlbfs tests.  Catalin, assuming
> there are no further comments, can this be considered for the arm64
> next tree?

Hi David,
Thanks for this revised series.

I have a few comments below. Most arose when I enabled STRICT_MM_TYPECHECKS.

I have tested this on my arm64 system with PAGE_SIZE==64KB, and it ran well.

Cheers,
--
Steve

>
>  arch/arm64/Kconfig |   3 -
>  arch/arm64/include/asm/hugetlb.h   |  44 ++
>  arch/arm64/include/asm/pgtable-hwdef.h |  18 ++-
>  arch/arm64/include/asm/pgtable.h   |  10 +-
>  arch/arm64/mm/hugetlbpage.c| 267 
> -
>  include/linux/hugetlb.h|   2 -
>  6 files changed, 306 insertions(+), 38 deletions(-)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 4876459..ffa3c54 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -530,9 +530,6 @@ config HW_PERF_EVENTS
>  config SYS_SUPPORTS_HUGETLBFS
> def_bool y
>
> -config ARCH_WANT_GENERAL_HUGETLB
> -   def_bool y
> -
>  config ARCH_WANT_HUGE_PMD_SHARE
> def_bool y if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
>
> diff --git a/arch/arm64/include/asm/hugetlb.h 
> b/arch/arm64/include/asm/hugetlb.h
> index bb4052e..bbc1e35 100644
> --- a/arch/arm64/include/asm/hugetlb.h
> +++ b/arch/arm64/include/asm/hugetlb.h
> @@ -26,36 +26,7 @@ static inline pte_t huge_ptep_get(pte_t *ptep)
> return *ptep;
>  }
>
> -static inline void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
> -  pte_t *ptep, pte_t pte)
> -{
> -   set_pte_at(mm, addr, ptep, pte);
> -}
> -
> -static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,
> -unsigned long addr, pte_t *ptep)
> -{
> -   ptep_clear_flush(vma, addr, ptep);
> -}
> -
> -static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
> -  unsigned long addr, pte_t *ptep)
> -{
> -   ptep_set_wrprotect(mm, addr, ptep);
> -}
>
> -static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
> -   unsigned long addr, pte_t *ptep)
> -{
> -   return ptep_get_and_clear(mm, addr, ptep);
> -}
> -
> -static inline int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> -unsigned long addr, pte_t *ptep,
> -pte_t pte, int dirty)
> -{
> -   return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
> -}
>
>  static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb,
>   unsigned long addr, unsigned long 
> end,
> @@ -97,4 +68,19 @@ static inline void arch_clear_hugepage_flags(struct page 
> *page)
> clear_bit(PG_dcache_clean, >flags);
>  }
>
> +extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
> +   struct page *page, int writable);
> +#define arch_make_huge_pte arch_make_huge_pte
> +extern void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
> +   pte_t *ptep, pte_t pte);
> +extern int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep,
> + pte_t pte, int dirty);
> +extern pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
> +unsigned long addr, pte_t *ptep);
> +extern void huge_ptep_set_wrprotect(struct mm_struct *mm,
> +   unsigned long addr, pte_t *ptep);
> +extern void 

Re: [RFC] kprobe'ing conditionally executed instructions

2015-12-11 Thread Steve Capper
On 11 December 2015 at 13:05, David Long  wrote:
> There is a moderate amount of code already in kprobes on ARM and the current
> ARMv8 patch to deal with conditional execution of instructions. One aspect
> of how this is handled is that instructions that fail their predicate and
> are not (technically) executed are also not treated as a hit kprobe. Steve
> Capper has suggested that the probe handling should still take place because
> we stepped through the instruction even if it was effectively a nop.  This
> would be a significant change in how it currently works on 32-bit ARM, and a
> change in the patch for ARMv8 (although it's not likely to be much of a
> change in the kernel code).
>
> I need input on this.  Do people have opinions?

Hi David,
Thanks for posting this.

Just to clarify the reasoning behind my suggestion for kprobes always
being hit was to achieve parity with x86.

I highlighted an example of discrepancy in behaviour between arm64 and
x86 in the following email:
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-August/364201.html

Cheers,
--
Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] kprobe'ing conditionally executed instructions

2015-12-11 Thread Steve Capper
On 11 December 2015 at 13:05, David Long <dave.l...@linaro.org> wrote:
> There is a moderate amount of code already in kprobes on ARM and the current
> ARMv8 patch to deal with conditional execution of instructions. One aspect
> of how this is handled is that instructions that fail their predicate and
> are not (technically) executed are also not treated as a hit kprobe. Steve
> Capper has suggested that the probe handling should still take place because
> we stepped through the instruction even if it was effectively a nop.  This
> would be a significant change in how it currently works on 32-bit ARM, and a
> change in the patch for ARMv8 (although it's not likely to be much of a
> change in the kernel code).
>
> I need input on this.  Do people have opinions?

Hi David,
Thanks for posting this.

Just to clarify the reasoning behind my suggestion for kprobes always
being hit was to achieve parity with x86.

I highlighted an example of discrepancy in behaviour between arm64 and
x86 in the following email:
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-August/364201.html

Cheers,
--
Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] arm64: Add support for PTE contiguous bit.

2015-10-20 Thread Steve Capper
On Mon, Oct 19, 2015 at 04:09:09PM -0400, David Woods wrote:
> The arm64 MMU supports a Contiguous bit which is a hint that the TTE
> is one of a set of contiguous entries which can be cached in a single
> TLB entry.  Supporting this bit adds new intermediate huge page sizes.
> 
> The set of huge page sizes available depends on the base page size.
> Without using contiguous pages the huge page sizes are as follows.
> 
>  4KB:   2MB  1GB
> 64KB: 512MB
> 
> With a 4KB granule, the contiguous bit groups together sets of 16 pages
> and with a 64KB granule it groups sets of 32 pages.  This enables two new
> huge page sizes in each case, so that the full set of available sizes
> is as follows.
> 
>  4KB:  64KB   2MB  32MB  1GB
> 64KB:   2MB 512MB  16GB
> 
> If a 16KB granule is used then the contiguous bit groups 128 pages
> at the PTE level and 32 pages at the PMD level.
> 
> If the base page size is set to 64KB then 2MB pages are enabled by
> default.  It is possible in the future to make 2MB the default huge
> page size for both 4KB and 64KB granules.

Thank you for the V2 David,
I have some comments below.

I would recommend running the next version of this series through
the libhugetlbfs test suite, as that may pick up a few things too.

Cheers,
-- 
Steve

> 
> Signed-off-by: David Woods 
> Reviewed-by: Chris Metcalf 
> ---
>  arch/arm64/Kconfig |   3 -
>  arch/arm64/include/asm/hugetlb.h   |  30 ++---
>  arch/arm64/include/asm/pgtable-hwdef.h |  20 
>  arch/arm64/include/asm/pgtable.h   |  33 +-
>  arch/arm64/mm/hugetlbpage.c| 211 
> -
>  5 files changed, 272 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 07d1811..3aa151d 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -464,9 +464,6 @@ config HW_PERF_EVENTS
>  config SYS_SUPPORTS_HUGETLBFS
>   def_bool y
>  
> -config ARCH_WANT_GENERAL_HUGETLB
> - def_bool y
> -
>  config ARCH_WANT_HUGE_PMD_SHARE
>   def_bool y if !ARM64_64K_PAGES
>  
> diff --git a/arch/arm64/include/asm/hugetlb.h 
> b/arch/arm64/include/asm/hugetlb.h
> index bb4052e..2b153a9 100644
> --- a/arch/arm64/include/asm/hugetlb.h
> +++ b/arch/arm64/include/asm/hugetlb.h
> @@ -26,12 +26,6 @@ static inline pte_t huge_ptep_get(pte_t *ptep)
>   return *ptep;
>  }
>  
> -static inline void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
> -pte_t *ptep, pte_t pte)
> -{
> - set_pte_at(mm, addr, ptep, pte);
> -}
> -
>  static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,
>unsigned long addr, pte_t *ptep)
>  {
> @@ -44,19 +38,6 @@ static inline void huge_ptep_set_wrprotect(struct 
> mm_struct *mm,
>   ptep_set_wrprotect(mm, addr, ptep);
>  }
>  
> -static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
> - unsigned long addr, pte_t *ptep)
> -{
> - return ptep_get_and_clear(mm, addr, ptep);
> -}
> -
> -static inline int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> -  unsigned long addr, pte_t *ptep,
> -  pte_t pte, int dirty)
> -{
> - return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
> -}
> -
>  static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb,
> unsigned long addr, unsigned long end,
> unsigned long floor,
> @@ -97,4 +78,15 @@ static inline void arch_clear_hugepage_flags(struct page 
> *page)
>   clear_bit(PG_dcache_clean, >flags);
>  }
>  
> +extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
> + struct page *page, int writable);
> +#define arch_make_huge_pte arch_make_huge_pte
> +extern void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte);
> +extern int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> +   unsigned long addr, pte_t *ptep,
> +   pte_t pte, int dirty);
> +extern pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
> +  unsigned long addr, pte_t *ptep);
> +
>  #endif /* __ASM_HUGETLB_H */
> diff --git a/arch/arm64/include/asm/pgtable-hwdef.h 
> b/arch/arm64/include/asm/pgtable-hwdef.h
> index 24154b0..1b921a5 100644
> --- a/arch/arm64/include/asm/pgtable-hwdef.h
> +++ b/arch/arm64/include/asm/pgtable-hwdef.h
> @@ -55,6 +55,24 @@
>  #define SECTION_MASK (~(SECTION_SIZE-1))
>  
>  /*
> + * Contiguous page definitions.
> + */
> +#ifdef CONFIG_ARM64_64K_PAGES
> +#define CONT_PTE_SHIFT   5
> +#define CONT_PMD_SHIFT   5
> +#else
> +#define CONT_PTE_SHIFT   4
> +#define CONT_PMD_SHIFT   4
> +#endif
> +
> 

Re: [PATCH v2] arm64: Add support for PTE contiguous bit.

2015-10-20 Thread Steve Capper
On Mon, Oct 19, 2015 at 04:09:09PM -0400, David Woods wrote:
> The arm64 MMU supports a Contiguous bit which is a hint that the TTE
> is one of a set of contiguous entries which can be cached in a single
> TLB entry.  Supporting this bit adds new intermediate huge page sizes.
> 
> The set of huge page sizes available depends on the base page size.
> Without using contiguous pages the huge page sizes are as follows.
> 
>  4KB:   2MB  1GB
> 64KB: 512MB
> 
> With a 4KB granule, the contiguous bit groups together sets of 16 pages
> and with a 64KB granule it groups sets of 32 pages.  This enables two new
> huge page sizes in each case, so that the full set of available sizes
> is as follows.
> 
>  4KB:  64KB   2MB  32MB  1GB
> 64KB:   2MB 512MB  16GB
> 
> If a 16KB granule is used then the contiguous bit groups 128 pages
> at the PTE level and 32 pages at the PMD level.
> 
> If the base page size is set to 64KB then 2MB pages are enabled by
> default.  It is possible in the future to make 2MB the default huge
> page size for both 4KB and 64KB granules.

Thank you for the V2 David,
I have some comments below.

I would recommend running the next version of this series through
the libhugetlbfs test suite, as that may pick up a few things too.

Cheers,
-- 
Steve

> 
> Signed-off-by: David Woods 
> Reviewed-by: Chris Metcalf 
> ---
>  arch/arm64/Kconfig |   3 -
>  arch/arm64/include/asm/hugetlb.h   |  30 ++---
>  arch/arm64/include/asm/pgtable-hwdef.h |  20 
>  arch/arm64/include/asm/pgtable.h   |  33 +-
>  arch/arm64/mm/hugetlbpage.c| 211 
> -
>  5 files changed, 272 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 07d1811..3aa151d 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -464,9 +464,6 @@ config HW_PERF_EVENTS
>  config SYS_SUPPORTS_HUGETLBFS
>   def_bool y
>  
> -config ARCH_WANT_GENERAL_HUGETLB
> - def_bool y
> -
>  config ARCH_WANT_HUGE_PMD_SHARE
>   def_bool y if !ARM64_64K_PAGES
>  
> diff --git a/arch/arm64/include/asm/hugetlb.h 
> b/arch/arm64/include/asm/hugetlb.h
> index bb4052e..2b153a9 100644
> --- a/arch/arm64/include/asm/hugetlb.h
> +++ b/arch/arm64/include/asm/hugetlb.h
> @@ -26,12 +26,6 @@ static inline pte_t huge_ptep_get(pte_t *ptep)
>   return *ptep;
>  }
>  
> -static inline void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
> -pte_t *ptep, pte_t pte)
> -{
> - set_pte_at(mm, addr, ptep, pte);
> -}
> -
>  static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,
>unsigned long addr, pte_t *ptep)
>  {
> @@ -44,19 +38,6 @@ static inline void huge_ptep_set_wrprotect(struct 
> mm_struct *mm,
>   ptep_set_wrprotect(mm, addr, ptep);
>  }
>  
> -static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
> - unsigned long addr, pte_t *ptep)
> -{
> - return ptep_get_and_clear(mm, addr, ptep);
> -}
> -
> -static inline int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> -  unsigned long addr, pte_t *ptep,
> -  pte_t pte, int dirty)
> -{
> - return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
> -}
> -
>  static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb,
> unsigned long addr, unsigned long end,
> unsigned long floor,
> @@ -97,4 +78,15 @@ static inline void arch_clear_hugepage_flags(struct page 
> *page)
>   clear_bit(PG_dcache_clean, >flags);
>  }
>  
> +extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
> + struct page *page, int writable);
> +#define arch_make_huge_pte arch_make_huge_pte
> +extern void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte);
> +extern int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> +   unsigned long addr, pte_t *ptep,
> +   pte_t pte, int dirty);
> +extern pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
> +  unsigned long addr, pte_t *ptep);
> +
>  #endif /* __ASM_HUGETLB_H */
> diff --git a/arch/arm64/include/asm/pgtable-hwdef.h 
> b/arch/arm64/include/asm/pgtable-hwdef.h
> index 24154b0..1b921a5 100644
> --- a/arch/arm64/include/asm/pgtable-hwdef.h
> +++ b/arch/arm64/include/asm/pgtable-hwdef.h
> @@ -55,6 +55,24 @@
>  #define SECTION_MASK (~(SECTION_SIZE-1))
>  
>  /*
> + * Contiguous page definitions.
> + */
> +#ifdef CONFIG_ARM64_64K_PAGES
> +#define CONT_PTE_SHIFT   5
> +#define CONT_PMD_SHIFT   5
> +#else
> +#define CONT_PTE_SHIFT   4
> +#define 

Re: [PATCHv3 10/11] arm64: Add 16K page size support

2015-10-15 Thread Steve Capper
On 15 October 2015 at 15:48, Suzuki K. Poulose  wrote:
> On 15/10/15 15:06, Mark Rutland wrote:
>>
>> Hi,
>>
>
> I have fixed all the nits locally. Thanks for pointing them out.
>
>>>   config FORCE_MAX_ZONEORDER
>>> int
>>> default "14" if (ARM64_64K_PAGES && TRANSPARENT_HUGEPAGE)
>>> +   default "12" if (ARM64_16K_PAGES && TRANSPARENT_HUGEPAGE)
>>> default "11"
>>
>>
>> I'm a little lost here. How are these numbers derived?
>>
>
> I struggled to find the right value for 16K. Thanks to Steve Capper
> for the following explanation. I will add it as a comment.
>
> All allocations from the buddy allocator have to have compound order
> strictly less than MAX_ORDER. i.e, the maximum allocation size is
> (MAX_ORDER - 1) PAGES. To align with the transparent huge page size,
> we get :
>
>  (MAX_ORDER - 1) + PAGE_SHIFT = PMD_SHIFT
>
> Which gives us:
>
> MAX_ORDER = PAGE_SHIFT - 3 + PAGE_SHIFT - PAGE_SHIFT + 1
>   = PAGE_SHIFT - 2
>
> That raises an interesting question about the selection of the value
> for 4K. Shouldn't that be 10 instead of 11 ?
>
> Steve ?

Hi,
My understanding is that 11 is a "good minimum" value for the page
allocator with 4KB pages.
(There are references to it being 10 in 2.4 kernels but raised to 11
on 2.6 kernels?)

We need to raise the minimum when we have a 16KB or 64KB PAGE_SIZE to
be able allocate a 32MB or 512MB Transparent HugePages.

Cheers,
--
Steve

>
>>> -#ifdef CONFIG_ARM64_64K_PAGES
>>> +#ifdefined(CONFIG_ARM64_64K_PAGES)
>>>   #define NR_FIX_BTMAPS 4
>>> +#elif  defined (CONFIG_ARM64_16K_PAGES)
>>> +#define NR_FIX_BTMAPS  16
>>>   #else
>>>   #define NR_FIX_BTMAPS 64
>>>   #endif
>>
>>
>> We could include  and simplify this to:
>>
>> #define NR_FIX_BTMAPS (SZ_256K / PAGE_SIZE)
>>
>> Which works for me locally.
>
>
> Nice cleanup. I will pick that as a separate patch in the series.
>
>>
>>> diff --git a/arch/arm64/include/asm/thread_info.h
>>> b/arch/arm64/include/asm/thread_info.h
>>> index 5eac6a2..90c7ff2 100644
>>> --- a/arch/arm64/include/asm/thread_info.h
>>> +++ b/arch/arm64/include/asm/thread_info.h
>>> @@ -25,6 +25,8 @@
>>>
>>>   #ifdef CONFIG_ARM64_4K_PAGES
>>>   #define THREAD_SIZE_ORDER 2
>>> +#elif defined(CONFIG_ARM64_16K_PAGES)
>>> +#define THREAD_SIZE_ORDER  0
>>>   #endif
>>>   #define THREAD_SIZE   16384
>>
>>
>> The above looks correct.
>>
>> As an open/general question, why do both THREAD_SIZE_ORDER and
>> THREAD_SIZE exist? One really should be defined in terms of the other.
>
>
> I think its mainly for choosing the mechanism for stack allocation. If it
> is a multiple of a page, you allocate a page. If not, uses a kmem_cache.
>
>
>>>   #define id_aa64mmfr0_tgran_shift  ID_AA64MMFR0_TGRAN4_SHIFT
>>>   #define id_aa64mmfr0_tgran_on ID_AA64MMFR0_TGRAN4_ON
>>
>>
>> I assume you'll s/ON/SUPPORTED/ per comments in another thread.
>>
>
> Yes
>
> Thanks
> Suzuki
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3 10/11] arm64: Add 16K page size support

2015-10-15 Thread Steve Capper
On 15 October 2015 at 15:48, Suzuki K. Poulose <suzuki.poul...@arm.com> wrote:
> On 15/10/15 15:06, Mark Rutland wrote:
>>
>> Hi,
>>
>
> I have fixed all the nits locally. Thanks for pointing them out.
>
>>>   config FORCE_MAX_ZONEORDER
>>> int
>>> default "14" if (ARM64_64K_PAGES && TRANSPARENT_HUGEPAGE)
>>> +   default "12" if (ARM64_16K_PAGES && TRANSPARENT_HUGEPAGE)
>>> default "11"
>>
>>
>> I'm a little lost here. How are these numbers derived?
>>
>
> I struggled to find the right value for 16K. Thanks to Steve Capper
> for the following explanation. I will add it as a comment.
>
> All allocations from the buddy allocator have to have compound order
> strictly less than MAX_ORDER. i.e, the maximum allocation size is
> (MAX_ORDER - 1) PAGES. To align with the transparent huge page size,
> we get :
>
>  (MAX_ORDER - 1) + PAGE_SHIFT = PMD_SHIFT
>
> Which gives us:
>
> MAX_ORDER = PAGE_SHIFT - 3 + PAGE_SHIFT - PAGE_SHIFT + 1
>   = PAGE_SHIFT - 2
>
> That raises an interesting question about the selection of the value
> for 4K. Shouldn't that be 10 instead of 11 ?
>
> Steve ?

Hi,
My understanding is that 11 is a "good minimum" value for the page
allocator with 4KB pages.
(There are references to it being 10 in 2.4 kernels but raised to 11
on 2.6 kernels?)

We need to raise the minimum when we have a 16KB or 64KB PAGE_SIZE to
be able allocate a 32MB or 512MB Transparent HugePages.

Cheers,
--
Steve

>
>>> -#ifdef CONFIG_ARM64_64K_PAGES
>>> +#ifdefined(CONFIG_ARM64_64K_PAGES)
>>>   #define NR_FIX_BTMAPS 4
>>> +#elif  defined (CONFIG_ARM64_16K_PAGES)
>>> +#define NR_FIX_BTMAPS  16
>>>   #else
>>>   #define NR_FIX_BTMAPS 64
>>>   #endif
>>
>>
>> We could include  and simplify this to:
>>
>> #define NR_FIX_BTMAPS (SZ_256K / PAGE_SIZE)
>>
>> Which works for me locally.
>
>
> Nice cleanup. I will pick that as a separate patch in the series.
>
>>
>>> diff --git a/arch/arm64/include/asm/thread_info.h
>>> b/arch/arm64/include/asm/thread_info.h
>>> index 5eac6a2..90c7ff2 100644
>>> --- a/arch/arm64/include/asm/thread_info.h
>>> +++ b/arch/arm64/include/asm/thread_info.h
>>> @@ -25,6 +25,8 @@
>>>
>>>   #ifdef CONFIG_ARM64_4K_PAGES
>>>   #define THREAD_SIZE_ORDER 2
>>> +#elif defined(CONFIG_ARM64_16K_PAGES)
>>> +#define THREAD_SIZE_ORDER  0
>>>   #endif
>>>   #define THREAD_SIZE   16384
>>
>>
>> The above looks correct.
>>
>> As an open/general question, why do both THREAD_SIZE_ORDER and
>> THREAD_SIZE exist? One really should be defined in terms of the other.
>
>
> I think its mainly for choosing the mechanism for stack allocation. If it
> is a multiple of a page, you allocate a page. If not, uses a kmem_cache.
>
>
>>>   #define id_aa64mmfr0_tgran_shift  ID_AA64MMFR0_TGRAN4_SHIFT
>>>   #define id_aa64mmfr0_tgran_on ID_AA64MMFR0_TGRAN4_ON
>>
>>
>> I assume you'll s/ON/SUPPORTED/ per comments in another thread.
>>
>
> Yes
>
> Thanks
> Suzuki
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 21/22] arm64: cpuinfo: Expose MIDR_EL1 and REVIDR_EL1 to sysfs

2015-10-06 Thread Steve Capper
On 6 October 2015 at 11:25, Mark Rutland  wrote:
> On Tue, Oct 06, 2015 at 11:18:42AM +0100, Steve Capper wrote:
>> On 6 October 2015 at 10:09, Russell King - ARM Linux
>>  wrote:
>> > On Mon, Oct 05, 2015 at 06:02:10PM +0100, Suzuki K. Poulose wrote:
>> >> +static int __init cpuinfo_regs_init(void)
>> >> +{
>> >> + int cpu, ret;
>> >> +
>> >> + for_each_present_cpu(cpu) {
>> >> + struct device *dev = get_cpu_device(cpu);
>> >> +
>> >> + if (!dev)
>> >> + return -1;
>> >
>> > NAK.  Go figure out why, I'm too lazy to tell you.
>>
>> I will correct the return code to be -ENODEV.
>> Was that the reasoning behind the NAK?
>
> I suspect the half-initialised sysfs groups also have something to do
> with it...

Okay, cheers Mark, I see what you mean.

>
> Mark.
>
>>
>> >
>> >> +
>> >> + ret = sysfs_create_group(>kobj, _attr_group);
>> >> + if (ret)
>> >> + return ret;
>> >> + }
>> >> +
>> >> + return 0;
>> >> +}
>> >> +
>> >> +device_initcall(cpuinfo_regs_init);
>> >> --
>> >> 1.7.9.5
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 21/22] arm64: cpuinfo: Expose MIDR_EL1 and REVIDR_EL1 to sysfs

2015-10-06 Thread Steve Capper
On 6 October 2015 at 10:09, Russell King - ARM Linux
 wrote:
> On Mon, Oct 05, 2015 at 06:02:10PM +0100, Suzuki K. Poulose wrote:
>> +static int __init cpuinfo_regs_init(void)
>> +{
>> + int cpu, ret;
>> +
>> + for_each_present_cpu(cpu) {
>> + struct device *dev = get_cpu_device(cpu);
>> +
>> + if (!dev)
>> + return -1;
>
> NAK.  Go figure out why, I'm too lazy to tell you.

I will correct the return code to be -ENODEV.
Was that the reasoning behind the NAK?

>
>> +
>> + ret = sysfs_create_group(>kobj, _attr_group);
>> + if (ret)
>> + return ret;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +device_initcall(cpuinfo_regs_init);
>> --
>> 1.7.9.5
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 21/22] arm64: cpuinfo: Expose MIDR_EL1 and REVIDR_EL1 to sysfs

2015-10-06 Thread Steve Capper
On 6 October 2015 at 10:09, Russell King - ARM Linux
 wrote:
> On Mon, Oct 05, 2015 at 06:02:10PM +0100, Suzuki K. Poulose wrote:
>> +static int __init cpuinfo_regs_init(void)
>> +{
>> + int cpu, ret;
>> +
>> + for_each_present_cpu(cpu) {
>> + struct device *dev = get_cpu_device(cpu);
>> +
>> + if (!dev)
>> + return -1;
>
> NAK.  Go figure out why, I'm too lazy to tell you.

I will correct the return code to be -ENODEV.
Was that the reasoning behind the NAK?

>
>> +
>> + ret = sysfs_create_group(>kobj, _attr_group);
>> + if (ret)
>> + return ret;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +device_initcall(cpuinfo_regs_init);
>> --
>> 1.7.9.5
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 21/22] arm64: cpuinfo: Expose MIDR_EL1 and REVIDR_EL1 to sysfs

2015-10-06 Thread Steve Capper
On 6 October 2015 at 11:25, Mark Rutland <mark.rutl...@arm.com> wrote:
> On Tue, Oct 06, 2015 at 11:18:42AM +0100, Steve Capper wrote:
>> On 6 October 2015 at 10:09, Russell King - ARM Linux
>> <li...@arm.linux.org.uk> wrote:
>> > On Mon, Oct 05, 2015 at 06:02:10PM +0100, Suzuki K. Poulose wrote:
>> >> +static int __init cpuinfo_regs_init(void)
>> >> +{
>> >> + int cpu, ret;
>> >> +
>> >> + for_each_present_cpu(cpu) {
>> >> + struct device *dev = get_cpu_device(cpu);
>> >> +
>> >> + if (!dev)
>> >> + return -1;
>> >
>> > NAK.  Go figure out why, I'm too lazy to tell you.
>>
>> I will correct the return code to be -ENODEV.
>> Was that the reasoning behind the NAK?
>
> I suspect the half-initialised sysfs groups also have something to do
> with it...

Okay, cheers Mark, I see what you mean.

>
> Mark.
>
>>
>> >
>> >> +
>> >> + ret = sysfs_create_group(>kobj, _attr_group);
>> >> + if (ret)
>> >> + return ret;
>> >> + }
>> >> +
>> >> + return 0;
>> >> +}
>> >> +
>> >> +device_initcall(cpuinfo_regs_init);
>> >> --
>> >> 1.7.9.5
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arm64: Add support for PTE contiguous bit.

2015-09-25 Thread Steve Capper
On 21 September 2015 at 09:44, David Woods  wrote:
>
> Steve,

Hi Dave,

>
> Thanks for your review and comments.  I take your points about the 16k
> granule - it's helpful to know that support is in the works. However, I'm
> not sure I agree with your reading of section 4.4.2. It's clear that for 16k
> granules, the number of contiguous pages is different for the PTE and PMD
> levels.  But I don't see anywhere it says that for 4K and 64K that the
> contig bit is not supported at the PMD level - just that the number of
> contiguous pages is the same at each level.

Many apologies, I appear to have led you down the garden path there.
Having double checked at ARM, the valid contiguous page sizes are indeed:
4K granule:
16 x ptes = 64K
16 x pmds = 32M
16 x puds = 16G

16K granule:
128 x ptes = 2M
32 x pmds = 1G

64K granule:
32 x ptes = 2M
32 x pmds = 16G

>
> I tried using the tarmac trace module of the ARM simulator to support this
> idea by turning on MMU tracing.  Using 4k granule, I created 64k and 32m
> pages and touched each location in the page.  In both cases, the trace
> recorded just one TLB fill (rather than the 16 you'd expect if the
> contiguous bit were being ignored) and it indicated the expected page size.
>
> 1817498494 clk cpu2 TLB FILL cpu2.S1TLB 64K 0x20_NS vmid=0, nG
> asid=303:0x08fa36_NS Normal InnerShareable Inner=WriteBackWriteAllocate
> Outer=WriteBackWriteAllocate xn=0 pxn=1 ContiguousHint=1
>
> 1263366314 clk cpu2 TLB FILL cpu2.UTLB 32M 0x20_NS vmid=0, nG
> asid=300:0x08f600_NS Normal InnerShareable Inner=WriteBackWriteAllocate
> Outer=WriteBackWriteAllocate xn=0 pxn=1 ContiguousHint=1
>
> I'll try this with a 64k granule next.  I'm not sure what will happen with
> 16G pages since we are using an A53 model which I don't think supports such
> large pages.

The Cortex-A53 supported TLB sizes can be found in the TRM:
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0500f/Chddiifa.html

My understanding is that the core is allowed to ignore the contiguous
bit if it doesn't support the particular TLB entry size, or substitute
in a slightly smaller TLB entry than hinted possible. Anyway, do give
it a go :-).

Cheers,
--
Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arm64: Add support for PTE contiguous bit.

2015-09-25 Thread Steve Capper
On 21 September 2015 at 09:44, David Woods  wrote:
>
> Steve,

Hi Dave,

>
> Thanks for your review and comments.  I take your points about the 16k
> granule - it's helpful to know that support is in the works. However, I'm
> not sure I agree with your reading of section 4.4.2. It's clear that for 16k
> granules, the number of contiguous pages is different for the PTE and PMD
> levels.  But I don't see anywhere it says that for 4K and 64K that the
> contig bit is not supported at the PMD level - just that the number of
> contiguous pages is the same at each level.

Many apologies, I appear to have led you down the garden path there.
Having double checked at ARM, the valid contiguous page sizes are indeed:
4K granule:
16 x ptes = 64K
16 x pmds = 32M
16 x puds = 16G

16K granule:
128 x ptes = 2M
32 x pmds = 1G

64K granule:
32 x ptes = 2M
32 x pmds = 16G

>
> I tried using the tarmac trace module of the ARM simulator to support this
> idea by turning on MMU tracing.  Using 4k granule, I created 64k and 32m
> pages and touched each location in the page.  In both cases, the trace
> recorded just one TLB fill (rather than the 16 you'd expect if the
> contiguous bit were being ignored) and it indicated the expected page size.
>
> 1817498494 clk cpu2 TLB FILL cpu2.S1TLB 64K 0x20_NS vmid=0, nG
> asid=303:0x08fa36_NS Normal InnerShareable Inner=WriteBackWriteAllocate
> Outer=WriteBackWriteAllocate xn=0 pxn=1 ContiguousHint=1
>
> 1263366314 clk cpu2 TLB FILL cpu2.UTLB 32M 0x20_NS vmid=0, nG
> asid=300:0x08f600_NS Normal InnerShareable Inner=WriteBackWriteAllocate
> Outer=WriteBackWriteAllocate xn=0 pxn=1 ContiguousHint=1
>
> I'll try this with a 64k granule next.  I'm not sure what will happen with
> 16G pages since we are using an A53 model which I don't think supports such
> large pages.

The Cortex-A53 supported TLB sizes can be found in the TRM:
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0500f/Chddiifa.html

My understanding is that the core is allowed to ignore the contiguous
bit if it doesn't support the particular TLB entry size, or substitute
in a slightly smaller TLB entry than hinted possible. Anyway, do give
it a go :-).

Cheers,
--
Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 7/7] arm64: Mark kernel page ranges contiguous

2015-09-17 Thread Steve Capper
Hi Jeremy,
One quick comment for now below.
I ran into a problem testing this on my Seattle board, and needed the fix below.

Cheers,
--
Steve

On 16 September 2015 at 20:03, Jeremy Linton  wrote:
> With 64k pages, the next larger segment size is 512M. The linux
> kernel also uses different protection flags to cover its code and data.
> Because of this requirement, the vast majority of the kernel code and
> data structures end up being mapped with 64k pages instead of the larger
> pages common with a 4k page kernel.
>
> Recent ARM processors support a contiguous bit in the
> page tables which allows the a TLB to cover a range larger than a
> single PTE if that range is mapped into physically contiguous
> ram.
>
> So, for the kernel its a good idea to set this flag. Some basic
> micro benchmarks show it can significantly reduce the number of
> L1 dTLB refills.
>
> Signed-off-by: Jeremy Linton 
> ---
>  arch/arm64/mm/mmu.c | 70 
> +++--
>  1 file changed, 62 insertions(+), 8 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 9211b85..c7abbcc 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -80,19 +80,55 @@ static void split_pmd(pmd_t *pmd, pte_t *pte)
> do {
> /*
>  * Need to have the least restrictive permissions available
> -* permissions will be fixed up later
> +* permissions will be fixed up later. Default the new page
> +* range as contiguous ptes.
>  */
> -   set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC));
> +   set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC_CONT));
> pfn++;
> } while (pte++, i++, i < PTRS_PER_PTE);
>  }
>
> +/*
> + * Given a PTE with the CONT bit set, determine where the CONT range
> + * starts, and clear the entire range of PTE CONT bits.
> + */
> +static void clear_cont_pte_range(pte_t *pte, unsigned long addr)
> +{
> +   int i;
> +
> +   pte -= CONT_RANGE_OFFSET(addr);
> +   for (i = 0; i < CONT_RANGE; i++) {
> +   set_pte(pte, pte_mknoncont(*pte));
> +   pte++;
> +   }
> +   flush_tlb_all();
> +}
> +
> +/*
> + * Given a range of PTEs set the pfn and provided page protection flags
> + */
> +static void __populate_init_pte(pte_t *pte, unsigned long addr,
> +   unsigned long end, phys_addr_t phys,
> +   pgprot_t prot)
> +{
> +   unsigned long pfn = __phys_to_pfn(phys);
> +
> +   do {
> +   /* clear all the bits except the pfn, then apply the prot */
> +   set_pte(pte, pfn_pte(pfn, prot));
> +   pte++;
> +   pfn++;
> +   addr += PAGE_SIZE;
> +   } while (addr != end);
> +}
> +
>  static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
> - unsigned long end, unsigned long pfn,
> + unsigned long end, phys_addr_t phys,
>   pgprot_t prot,
>   void *(*alloc)(unsigned long size))
>  {
> pte_t *pte;
> +   unsigned long next;
>
> if (pmd_none(*pmd) || pmd_sect(*pmd)) {
> pte = alloc(PTRS_PER_PTE * sizeof(pte_t));
> @@ -105,9 +141,28 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long 
> addr,
>
> pte = pte_offset_kernel(pmd, addr);
> do {
> -   set_pte(pte, pfn_pte(pfn, prot));
> -   pfn++;
> -   } while (pte++, addr += PAGE_SIZE, addr != end);
> +   next = min(end, (addr + CONT_SIZE) & CONT_MASK);
> +   if (((addr | next | phys) & CONT_RANGE_MASK) == 0) {
> +   /* a block of CONT_RANGE_SIZE PTEs */
> +   __populate_init_pte(pte, addr, next, phys,
> +   prot | __pgprot(PTE_CONT));
> +   pte += CONT_RANGE;
> +   } else {
> +   /*
> +* If the range being split is already inside of a
> +* contiguous range but this PTE isn't going to be
> +* contiguous, then we want to unmark the adjacent
> +* ranges, then update the portion of the range we
> +* are interrested in.
> +*/
> +clear_cont_pte_range(pte, addr);
> +__populate_init_pte(pte, addr, next, phys, prot);
> +pte += CONT_RANGE_OFFSET(next - addr);

I think this should instead be:
pte += (next - addr) >> PAGE_SHIFT;

Without the above change, I get panics on boot with my Seattle board
when efi_rtc is initialised.
(I think the EFI runtime stuff exacerbates the non-contiguous code
path hence I notice it on my system).

> +   }
> +
> +   

Re: [PATCH 7/7] arm64: Mark kernel page ranges contiguous

2015-09-17 Thread Steve Capper
Hi Jeremy,
One quick comment for now below.
I ran into a problem testing this on my Seattle board, and needed the fix below.

Cheers,
--
Steve

On 16 September 2015 at 20:03, Jeremy Linton  wrote:
> With 64k pages, the next larger segment size is 512M. The linux
> kernel also uses different protection flags to cover its code and data.
> Because of this requirement, the vast majority of the kernel code and
> data structures end up being mapped with 64k pages instead of the larger
> pages common with a 4k page kernel.
>
> Recent ARM processors support a contiguous bit in the
> page tables which allows the a TLB to cover a range larger than a
> single PTE if that range is mapped into physically contiguous
> ram.
>
> So, for the kernel its a good idea to set this flag. Some basic
> micro benchmarks show it can significantly reduce the number of
> L1 dTLB refills.
>
> Signed-off-by: Jeremy Linton 
> ---
>  arch/arm64/mm/mmu.c | 70 
> +++--
>  1 file changed, 62 insertions(+), 8 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 9211b85..c7abbcc 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -80,19 +80,55 @@ static void split_pmd(pmd_t *pmd, pte_t *pte)
> do {
> /*
>  * Need to have the least restrictive permissions available
> -* permissions will be fixed up later
> +* permissions will be fixed up later. Default the new page
> +* range as contiguous ptes.
>  */
> -   set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC));
> +   set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC_CONT));
> pfn++;
> } while (pte++, i++, i < PTRS_PER_PTE);
>  }
>
> +/*
> + * Given a PTE with the CONT bit set, determine where the CONT range
> + * starts, and clear the entire range of PTE CONT bits.
> + */
> +static void clear_cont_pte_range(pte_t *pte, unsigned long addr)
> +{
> +   int i;
> +
> +   pte -= CONT_RANGE_OFFSET(addr);
> +   for (i = 0; i < CONT_RANGE; i++) {
> +   set_pte(pte, pte_mknoncont(*pte));
> +   pte++;
> +   }
> +   flush_tlb_all();
> +}
> +
> +/*
> + * Given a range of PTEs set the pfn and provided page protection flags
> + */
> +static void __populate_init_pte(pte_t *pte, unsigned long addr,
> +   unsigned long end, phys_addr_t phys,
> +   pgprot_t prot)
> +{
> +   unsigned long pfn = __phys_to_pfn(phys);
> +
> +   do {
> +   /* clear all the bits except the pfn, then apply the prot */
> +   set_pte(pte, pfn_pte(pfn, prot));
> +   pte++;
> +   pfn++;
> +   addr += PAGE_SIZE;
> +   } while (addr != end);
> +}
> +
>  static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
> - unsigned long end, unsigned long pfn,
> + unsigned long end, phys_addr_t phys,
>   pgprot_t prot,
>   void *(*alloc)(unsigned long size))
>  {
> pte_t *pte;
> +   unsigned long next;
>
> if (pmd_none(*pmd) || pmd_sect(*pmd)) {
> pte = alloc(PTRS_PER_PTE * sizeof(pte_t));
> @@ -105,9 +141,28 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long 
> addr,
>
> pte = pte_offset_kernel(pmd, addr);
> do {
> -   set_pte(pte, pfn_pte(pfn, prot));
> -   pfn++;
> -   } while (pte++, addr += PAGE_SIZE, addr != end);
> +   next = min(end, (addr + CONT_SIZE) & CONT_MASK);
> +   if (((addr | next | phys) & CONT_RANGE_MASK) == 0) {
> +   /* a block of CONT_RANGE_SIZE PTEs */
> +   __populate_init_pte(pte, addr, next, phys,
> +   prot | __pgprot(PTE_CONT));
> +   pte += CONT_RANGE;
> +   } else {
> +   /*
> +* If the range being split is already inside of a
> +* contiguous range but this PTE isn't going to be
> +* contiguous, then we want to unmark the adjacent
> +* ranges, then update the portion of the range we
> +* are interrested in.
> +*/
> +clear_cont_pte_range(pte, addr);
> +__populate_init_pte(pte, addr, next, phys, prot);
> +pte += CONT_RANGE_OFFSET(next - addr);

I think this should instead be:
pte += (next - addr) >> PAGE_SHIFT;

Without the above change, I get panics on boot with my Seattle board
when efi_rtc is initialised.
(I think the EFI runtime stuff exacerbates the non-contiguous code
path hence I notice it on my system).


Re: [PATCH] arm64: Add support for PTE contiguous bit.

2015-09-16 Thread Steve Capper
Hi David,
Some initial comments below.

Cheers,
-- 
Steve

On Tue, Sep 15, 2015 at 02:01:57PM -0400, David Woods wrote:
> The arm64 MMU supports a Contiguous bit which is a hint that the TTE
> is one of a set of contiguous entries which can be cached in a single
> TLB entry.  Supporting this bit adds new intermediate huge page sizes.
> 
> The set of huge page sizes available depends on the base page size.
> Without using contiguous pages the huge page sizes are as follows.
> 
>  4KB:   2MB  1GB
> 64KB: 512MB  4TB
> 
> With 4KB pages, the contiguous bit groups together sets of 16 pages
> and with 64KB pages it groups sets of 32 pages.  This enables two new
> huge page sizes in each case, so that the full set of available sizes
> is as follows.
> 
>  4KB:  64KB   2MB  32MB  1GB
> 64KB:   2MB 512MB  16GB  4TB
> 
> If the base page size is set to 64KB then 2MB pages are enabled by
> default.  It is possible in the future to make 2MB the default huge
> page size for both 4KB and 64KB pages.
> 
> Signed-off-by: David Woods 
> Reviewed-by: Chris Metcalf 
> ---
>  arch/arm64/Kconfig |   3 -
>  arch/arm64/include/asm/hugetlb.h   |   4 +
>  arch/arm64/include/asm/pgtable-hwdef.h |  15 +++
>  arch/arm64/include/asm/pgtable.h   |  30 +-
>  arch/arm64/mm/hugetlbpage.c| 165 
> -
>  5 files changed, 210 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 7d95663..8310e38 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -447,9 +447,6 @@ config HW_PERF_EVENTS
>  config SYS_SUPPORTS_HUGETLBFS
>   def_bool y
>  
> -config ARCH_WANT_GENERAL_HUGETLB
> - def_bool y
> -
>  config ARCH_WANT_HUGE_PMD_SHARE
>   def_bool y if !ARM64_64K_PAGES
>  
> diff --git a/arch/arm64/include/asm/hugetlb.h 
> b/arch/arm64/include/asm/hugetlb.h
> index bb4052e..e5af553 100644
> --- a/arch/arm64/include/asm/hugetlb.h
> +++ b/arch/arm64/include/asm/hugetlb.h
> @@ -97,4 +97,8 @@ static inline void arch_clear_hugepage_flags(struct page 
> *page)
>   clear_bit(PG_dcache_clean, >flags);
>  }
>  
> +extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
> + struct page *page, int writable);
> +#define arch_make_huge_pte arch_make_huge_pte
> +
>  #endif /* __ASM_HUGETLB_H */
> diff --git a/arch/arm64/include/asm/pgtable-hwdef.h 
> b/arch/arm64/include/asm/pgtable-hwdef.h
> index 24154b0..da73243 100644
> --- a/arch/arm64/include/asm/pgtable-hwdef.h
> +++ b/arch/arm64/include/asm/pgtable-hwdef.h
> @@ -55,6 +55,19 @@
>  #define SECTION_MASK (~(SECTION_SIZE-1))
>  
>  /*
> + * Contiguous large page definitions.
> + */
> +#ifdef CONFIG_ARM64_64K_PAGES
> +#define  CONTIG_SHIFT5
> +#define CONTIG_PAGES 32
> +#else
> +#define  CONTIG_SHIFT4
> +#define CONTIG_PAGES 16
> +#endif
> +#define  CONTIG_PTE_SIZE (CONTIG_PAGES * PAGE_SIZE)
> +#define  CONTIG_PTE_MASK (~(CONTIG_PTE_SIZE - 1))

Careful here, CONTIG_PAGES should really be CONTIG_PTES.

If support is added for a 16KB granule case we are allowed:
128 x 16KB pages (ptes) to make a 2MB huge page, or
32 x 32MB blocks (pmds) to make a 1GB huge page.

i.e we CONTIG_PTES != CONTIG_PMDs

For 4KB or 64KB pages we are only allowed contiguous pte's so
CONTIG_PMDS == 0 in these cases.

> +
> +/*
>   * Hardware page table definitions.
>   *
>   * Level 1 descriptor (PUD).
> @@ -83,6 +96,7 @@
>  #define PMD_SECT_S   (_AT(pmdval_t, 3) << 8)
>  #define PMD_SECT_AF  (_AT(pmdval_t, 1) << 10)
>  #define PMD_SECT_NG  (_AT(pmdval_t, 1) << 11)
> +#define PMD_SECT_CONTIG  (_AT(pmdval_t, 1) << 52)
>  #define PMD_SECT_PXN (_AT(pmdval_t, 1) << 53)
>  #define PMD_SECT_UXN (_AT(pmdval_t, 1) << 54)
>  
> @@ -105,6 +119,7 @@
>  #define PTE_AF   (_AT(pteval_t, 1) << 10)/* 
> Access Flag */
>  #define PTE_NG   (_AT(pteval_t, 1) << 11)/* nG */
>  #define PTE_DBM  (_AT(pteval_t, 1) << 51)/* 
> Dirty Bit Management */
> +#define PTE_CONTIG   (_AT(pteval_t, 1) << 52)/* Contiguous */
>  #define PTE_PXN  (_AT(pteval_t, 1) << 53)/* 
> Privileged XN */
>  #define PTE_UXN  (_AT(pteval_t, 1) << 54)/* User 
> XN */
>  
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 6900b2d9..df5ec64 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -144,6 +144,7 @@ extern struct page *empty_zero_page;
>  #define pte_special(pte) (!!(pte_val(pte) & PTE_SPECIAL))
>  #define pte_write(pte)   (!!(pte_val(pte) & PTE_WRITE))
>  #define pte_exec(pte)(!(pte_val(pte) & PTE_UXN))
> +#define pte_contig(pte)  (!!(pte_val(pte) & PTE_CONTIG))
>  
>  #ifdef 

Re: [PATCH] arm64: Add support for PTE contiguous bit.

2015-09-16 Thread Steve Capper
On 15 September 2015 at 19:01, David Woods  wrote:
> The arm64 MMU supports a Contiguous bit which is a hint that the TTE
> is one of a set of contiguous entries which can be cached in a single
> TLB entry.  Supporting this bit adds new intermediate huge page sizes.
>
> The set of huge page sizes available depends on the base page size.
> Without using contiguous pages the huge page sizes are as follows.
>
>  4KB:   2MB  1GB
> 64KB: 512MB  4TB

We just have 512MB for a 64KB granule.
As per [1] D4.2.6 - "The VMSAv8-64 translation table format" page D4-1668.

>
> With 4KB pages, the contiguous bit groups together sets of 16 pages
> and with 64KB pages it groups sets of 32 pages.  This enables two new
> huge page sizes in each case, so that the full set of available sizes
> is as follows.
>
>  4KB:  64KB   2MB  32MB  1GB
> 64KB:   2MB 512MB  16GB  4TB
>
> If the base page size is set to 64KB then 2MB pages are enabled by
> default.  It is possible in the future to make 2MB the default huge
> page size for both 4KB and 64KB pages.
>

Hi David,
Thanks for posting this, and apologies in advance for talking about
the ARM ARM[1]...

D4.4.2 "Other fields in the VMSAv8-64 translation table format
descriptors" (page D4-1715)
Only gives examples of the contiguous bit being used for level 3
descriptors (i.e. PTEs) when running with a 4KB and 64KB granule.

With a 16KB granule we *can* have a contiguous bit being used by level
2 descriptors (i.e. PMDs), so the pmd_contig logic could perhaps be
used in combination with Suzuki's 16KB PAGE_SIZE series at:
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-September/370117.html

I will read through the rest of the patch and post more feedback

Cheers,
--
Steve

[1] - http://infocenter.arm.com/help/topic/com.arm.doc.ddi0487a.g/index.html




> Signed-off-by: David Woods 
> Reviewed-by: Chris Metcalf 
> ---
>  arch/arm64/Kconfig |   3 -
>  arch/arm64/include/asm/hugetlb.h   |   4 +
>  arch/arm64/include/asm/pgtable-hwdef.h |  15 +++
>  arch/arm64/include/asm/pgtable.h   |  30 +-
>  arch/arm64/mm/hugetlbpage.c| 165 
> -
>  5 files changed, 210 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 7d95663..8310e38 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -447,9 +447,6 @@ config HW_PERF_EVENTS
>  config SYS_SUPPORTS_HUGETLBFS
> def_bool y
>
> -config ARCH_WANT_GENERAL_HUGETLB
> -   def_bool y
> -
>  config ARCH_WANT_HUGE_PMD_SHARE
> def_bool y if !ARM64_64K_PAGES
>
> diff --git a/arch/arm64/include/asm/hugetlb.h 
> b/arch/arm64/include/asm/hugetlb.h
> index bb4052e..e5af553 100644
> --- a/arch/arm64/include/asm/hugetlb.h
> +++ b/arch/arm64/include/asm/hugetlb.h
> @@ -97,4 +97,8 @@ static inline void arch_clear_hugepage_flags(struct page 
> *page)
> clear_bit(PG_dcache_clean, >flags);
>  }
>
> +extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
> +   struct page *page, int writable);
> +#define arch_make_huge_pte arch_make_huge_pte
> +
>  #endif /* __ASM_HUGETLB_H */
> diff --git a/arch/arm64/include/asm/pgtable-hwdef.h 
> b/arch/arm64/include/asm/pgtable-hwdef.h
> index 24154b0..da73243 100644
> --- a/arch/arm64/include/asm/pgtable-hwdef.h
> +++ b/arch/arm64/include/asm/pgtable-hwdef.h
> @@ -55,6 +55,19 @@
>  #define SECTION_MASK   (~(SECTION_SIZE-1))
>
>  /*
> + * Contiguous large page definitions.
> + */
> +#ifdef CONFIG_ARM64_64K_PAGES
> +#defineCONTIG_SHIFT5
> +#define CONTIG_PAGES   32
> +#else
> +#defineCONTIG_SHIFT4
> +#define CONTIG_PAGES   16
> +#endif
> +#defineCONTIG_PTE_SIZE (CONTIG_PAGES * PAGE_SIZE)
> +#defineCONTIG_PTE_MASK (~(CONTIG_PTE_SIZE - 1))
> +
> +/*
>   * Hardware page table definitions.
>   *
>   * Level 1 descriptor (PUD).
> @@ -83,6 +96,7 @@
>  #define PMD_SECT_S (_AT(pmdval_t, 3) << 8)
>  #define PMD_SECT_AF(_AT(pmdval_t, 1) << 10)
>  #define PMD_SECT_NG(_AT(pmdval_t, 1) << 11)
> +#define PMD_SECT_CONTIG(_AT(pmdval_t, 1) << 52)
>  #define PMD_SECT_PXN   (_AT(pmdval_t, 1) << 53)
>  #define PMD_SECT_UXN   (_AT(pmdval_t, 1) << 54)
>
> @@ -105,6 +119,7 @@
>  #define PTE_AF (_AT(pteval_t, 1) << 10)/* Access 
> Flag */
>  #define PTE_NG (_AT(pteval_t, 1) << 11)/* nG */
>  #define PTE_DBM(_AT(pteval_t, 1) << 51)/* 
> Dirty Bit Management */
> +#define PTE_CONTIG (_AT(pteval_t, 1) << 52)/* Contiguous 
> */
>  #define PTE_PXN(_AT(pteval_t, 1) << 53)/* 
> Privileged XN */
>  #define PTE_UXN(_AT(pteval_t, 1) << 54)/* 
> User XN */
>
> diff --git a/arch/arm64/include/asm/pgtable.h 
> 

Re: [PATCH] arm64: Add support for PTE contiguous bit.

2015-09-16 Thread Steve Capper
On 15 September 2015 at 19:01, David Woods  wrote:
> The arm64 MMU supports a Contiguous bit which is a hint that the TTE
> is one of a set of contiguous entries which can be cached in a single
> TLB entry.  Supporting this bit adds new intermediate huge page sizes.
>
> The set of huge page sizes available depends on the base page size.
> Without using contiguous pages the huge page sizes are as follows.
>
>  4KB:   2MB  1GB
> 64KB: 512MB  4TB

We just have 512MB for a 64KB granule.
As per [1] D4.2.6 - "The VMSAv8-64 translation table format" page D4-1668.

>
> With 4KB pages, the contiguous bit groups together sets of 16 pages
> and with 64KB pages it groups sets of 32 pages.  This enables two new
> huge page sizes in each case, so that the full set of available sizes
> is as follows.
>
>  4KB:  64KB   2MB  32MB  1GB
> 64KB:   2MB 512MB  16GB  4TB
>
> If the base page size is set to 64KB then 2MB pages are enabled by
> default.  It is possible in the future to make 2MB the default huge
> page size for both 4KB and 64KB pages.
>

Hi David,
Thanks for posting this, and apologies in advance for talking about
the ARM ARM[1]...

D4.4.2 "Other fields in the VMSAv8-64 translation table format
descriptors" (page D4-1715)
Only gives examples of the contiguous bit being used for level 3
descriptors (i.e. PTEs) when running with a 4KB and 64KB granule.

With a 16KB granule we *can* have a contiguous bit being used by level
2 descriptors (i.e. PMDs), so the pmd_contig logic could perhaps be
used in combination with Suzuki's 16KB PAGE_SIZE series at:
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-September/370117.html

I will read through the rest of the patch and post more feedback

Cheers,
--
Steve

[1] - http://infocenter.arm.com/help/topic/com.arm.doc.ddi0487a.g/index.html




> Signed-off-by: David Woods 
> Reviewed-by: Chris Metcalf 
> ---
>  arch/arm64/Kconfig |   3 -
>  arch/arm64/include/asm/hugetlb.h   |   4 +
>  arch/arm64/include/asm/pgtable-hwdef.h |  15 +++
>  arch/arm64/include/asm/pgtable.h   |  30 +-
>  arch/arm64/mm/hugetlbpage.c| 165 
> -
>  5 files changed, 210 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 7d95663..8310e38 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -447,9 +447,6 @@ config HW_PERF_EVENTS
>  config SYS_SUPPORTS_HUGETLBFS
> def_bool y
>
> -config ARCH_WANT_GENERAL_HUGETLB
> -   def_bool y
> -
>  config ARCH_WANT_HUGE_PMD_SHARE
> def_bool y if !ARM64_64K_PAGES
>
> diff --git a/arch/arm64/include/asm/hugetlb.h 
> b/arch/arm64/include/asm/hugetlb.h
> index bb4052e..e5af553 100644
> --- a/arch/arm64/include/asm/hugetlb.h
> +++ b/arch/arm64/include/asm/hugetlb.h
> @@ -97,4 +97,8 @@ static inline void arch_clear_hugepage_flags(struct page 
> *page)
> clear_bit(PG_dcache_clean, >flags);
>  }
>
> +extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
> +   struct page *page, int writable);
> +#define arch_make_huge_pte arch_make_huge_pte
> +
>  #endif /* __ASM_HUGETLB_H */
> diff --git a/arch/arm64/include/asm/pgtable-hwdef.h 
> b/arch/arm64/include/asm/pgtable-hwdef.h
> index 24154b0..da73243 100644
> --- a/arch/arm64/include/asm/pgtable-hwdef.h
> +++ b/arch/arm64/include/asm/pgtable-hwdef.h
> @@ -55,6 +55,19 @@
>  #define SECTION_MASK   (~(SECTION_SIZE-1))
>
>  /*
> + * Contiguous large page definitions.
> + */
> +#ifdef CONFIG_ARM64_64K_PAGES
> +#defineCONTIG_SHIFT5
> +#define CONTIG_PAGES   32
> +#else
> +#defineCONTIG_SHIFT4
> +#define CONTIG_PAGES   16
> +#endif
> +#defineCONTIG_PTE_SIZE (CONTIG_PAGES * PAGE_SIZE)
> +#defineCONTIG_PTE_MASK (~(CONTIG_PTE_SIZE - 1))
> +
> +/*
>   * Hardware page table definitions.
>   *
>   * Level 1 descriptor (PUD).
> @@ -83,6 +96,7 @@
>  #define PMD_SECT_S (_AT(pmdval_t, 3) << 8)
>  #define PMD_SECT_AF(_AT(pmdval_t, 1) << 10)
>  #define PMD_SECT_NG(_AT(pmdval_t, 1) << 11)
> +#define PMD_SECT_CONTIG(_AT(pmdval_t, 1) << 52)
>  #define PMD_SECT_PXN   (_AT(pmdval_t, 1) << 53)
>  #define PMD_SECT_UXN   (_AT(pmdval_t, 1) << 54)
>
> @@ -105,6 +119,7 @@
>  #define PTE_AF (_AT(pteval_t, 1) << 10)/* Access 
> Flag */
>  #define PTE_NG (_AT(pteval_t, 1) << 11)/* nG */
>  #define PTE_DBM(_AT(pteval_t, 1) << 51)/* 
> Dirty Bit Management */
> +#define PTE_CONTIG (_AT(pteval_t, 1) << 52)/* Contiguous 
> */
>  #define PTE_PXN(_AT(pteval_t, 1) << 53)/* 
> Privileged XN */
>  #define PTE_UXN(_AT(pteval_t, 1) << 54)/* 
> User XN */
>
> diff --git 

Re: [PATCH] arm64: Add support for PTE contiguous bit.

2015-09-16 Thread Steve Capper
Hi David,
Some initial comments below.

Cheers,
-- 
Steve

On Tue, Sep 15, 2015 at 02:01:57PM -0400, David Woods wrote:
> The arm64 MMU supports a Contiguous bit which is a hint that the TTE
> is one of a set of contiguous entries which can be cached in a single
> TLB entry.  Supporting this bit adds new intermediate huge page sizes.
> 
> The set of huge page sizes available depends on the base page size.
> Without using contiguous pages the huge page sizes are as follows.
> 
>  4KB:   2MB  1GB
> 64KB: 512MB  4TB
> 
> With 4KB pages, the contiguous bit groups together sets of 16 pages
> and with 64KB pages it groups sets of 32 pages.  This enables two new
> huge page sizes in each case, so that the full set of available sizes
> is as follows.
> 
>  4KB:  64KB   2MB  32MB  1GB
> 64KB:   2MB 512MB  16GB  4TB
> 
> If the base page size is set to 64KB then 2MB pages are enabled by
> default.  It is possible in the future to make 2MB the default huge
> page size for both 4KB and 64KB pages.
> 
> Signed-off-by: David Woods 
> Reviewed-by: Chris Metcalf 
> ---
>  arch/arm64/Kconfig |   3 -
>  arch/arm64/include/asm/hugetlb.h   |   4 +
>  arch/arm64/include/asm/pgtable-hwdef.h |  15 +++
>  arch/arm64/include/asm/pgtable.h   |  30 +-
>  arch/arm64/mm/hugetlbpage.c| 165 
> -
>  5 files changed, 210 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 7d95663..8310e38 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -447,9 +447,6 @@ config HW_PERF_EVENTS
>  config SYS_SUPPORTS_HUGETLBFS
>   def_bool y
>  
> -config ARCH_WANT_GENERAL_HUGETLB
> - def_bool y
> -
>  config ARCH_WANT_HUGE_PMD_SHARE
>   def_bool y if !ARM64_64K_PAGES
>  
> diff --git a/arch/arm64/include/asm/hugetlb.h 
> b/arch/arm64/include/asm/hugetlb.h
> index bb4052e..e5af553 100644
> --- a/arch/arm64/include/asm/hugetlb.h
> +++ b/arch/arm64/include/asm/hugetlb.h
> @@ -97,4 +97,8 @@ static inline void arch_clear_hugepage_flags(struct page 
> *page)
>   clear_bit(PG_dcache_clean, >flags);
>  }
>  
> +extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
> + struct page *page, int writable);
> +#define arch_make_huge_pte arch_make_huge_pte
> +
>  #endif /* __ASM_HUGETLB_H */
> diff --git a/arch/arm64/include/asm/pgtable-hwdef.h 
> b/arch/arm64/include/asm/pgtable-hwdef.h
> index 24154b0..da73243 100644
> --- a/arch/arm64/include/asm/pgtable-hwdef.h
> +++ b/arch/arm64/include/asm/pgtable-hwdef.h
> @@ -55,6 +55,19 @@
>  #define SECTION_MASK (~(SECTION_SIZE-1))
>  
>  /*
> + * Contiguous large page definitions.
> + */
> +#ifdef CONFIG_ARM64_64K_PAGES
> +#define  CONTIG_SHIFT5
> +#define CONTIG_PAGES 32
> +#else
> +#define  CONTIG_SHIFT4
> +#define CONTIG_PAGES 16
> +#endif
> +#define  CONTIG_PTE_SIZE (CONTIG_PAGES * PAGE_SIZE)
> +#define  CONTIG_PTE_MASK (~(CONTIG_PTE_SIZE - 1))

Careful here, CONTIG_PAGES should really be CONTIG_PTES.

If support is added for a 16KB granule case we are allowed:
128 x 16KB pages (ptes) to make a 2MB huge page, or
32 x 32MB blocks (pmds) to make a 1GB huge page.

i.e we CONTIG_PTES != CONTIG_PMDs

For 4KB or 64KB pages we are only allowed contiguous pte's so
CONTIG_PMDS == 0 in these cases.

> +
> +/*
>   * Hardware page table definitions.
>   *
>   * Level 1 descriptor (PUD).
> @@ -83,6 +96,7 @@
>  #define PMD_SECT_S   (_AT(pmdval_t, 3) << 8)
>  #define PMD_SECT_AF  (_AT(pmdval_t, 1) << 10)
>  #define PMD_SECT_NG  (_AT(pmdval_t, 1) << 11)
> +#define PMD_SECT_CONTIG  (_AT(pmdval_t, 1) << 52)
>  #define PMD_SECT_PXN (_AT(pmdval_t, 1) << 53)
>  #define PMD_SECT_UXN (_AT(pmdval_t, 1) << 54)
>  
> @@ -105,6 +119,7 @@
>  #define PTE_AF   (_AT(pteval_t, 1) << 10)/* 
> Access Flag */
>  #define PTE_NG   (_AT(pteval_t, 1) << 11)/* nG */
>  #define PTE_DBM  (_AT(pteval_t, 1) << 51)/* 
> Dirty Bit Management */
> +#define PTE_CONTIG   (_AT(pteval_t, 1) << 52)/* Contiguous */
>  #define PTE_PXN  (_AT(pteval_t, 1) << 53)/* 
> Privileged XN */
>  #define PTE_UXN  (_AT(pteval_t, 1) << 54)/* User 
> XN */
>  
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 6900b2d9..df5ec64 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -144,6 +144,7 @@ extern struct page *empty_zero_page;
>  #define pte_special(pte) (!!(pte_val(pte) & PTE_SPECIAL))
>  #define pte_write(pte)   (!!(pte_val(pte) & PTE_WRITE))
>  #define pte_exec(pte)(!(pte_val(pte) & PTE_UXN))
> +#define pte_contig(pte)  (!!(pte_val(pte) & 

Re: [PATCH 12/14] arm64: Check for selected granule support

2015-08-13 Thread Steve Capper
On 13 August 2015 at 12:34, Suzuki K. Poulose  wrote:
> From: "Suzuki K. Poulose" 
>
> Ensure that the selected page size is supported by the
> CPU(s).
>
> Cc: Mark Rutland 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Signed-off-by: Suzuki K. Poulose 
> ---
>  arch/arm64/include/asm/sysreg.h |6 ++
>  arch/arm64/kernel/head.S|   24 +++-
>  2 files changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> index a7f3d4b..e01d323 100644
> --- a/arch/arm64/include/asm/sysreg.h
> +++ b/arch/arm64/include/asm/sysreg.h
> @@ -87,4 +87,10 @@ static inline void config_sctlr_el1(u32 clear, u32 set)
>  }
>  #endif
>
> +#define ID_AA64MMFR0_TGran4_SHIFT  28
> +#define ID_AA64MMFR0_TGran64_SHIFT 24
> +
> +#define ID_AA64MMFR0_TGran4_ENABLED0x0
> +#define ID_AA64MMFR0_TGran64_ENABLED   0x0
> +
>  #endif /* __ASM_SYSREG_H */
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 01b8e58..0cb04db 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -31,10 +31,11 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>
>  #define __PHYS_OFFSET  (KERNEL_START - TEXT_OFFSET)
> @@ -606,9 +607,25 @@ ENDPROC(__secondary_switched)
>   *  x27 = *virtual* address to jump to upon completion
>   *
>   * other registers depend on the function called upon completion
> + * Checks if the selected granule size is supported by the CPU.
>   */
> +#ifdefined(CONFIG_ARM64_64K_PAGES)
> +
> +#define ID_AA64MMFR0_TGran_SHIFT   ID_AA64MMFR0_TGran64_SHIFT
> +#define ID_AA64MMFR0_TGran_ENABLED ID_AA64MMFR0_TGran64_ENABLED
> +
> +#else
> +
> +#define ID_AA64MMFR0_TGran_SHIFT   ID_AA64MMFR0_TGran4_SHIFT
> +#define ID_AA64MMFR0_TGran_ENABLED ID_AA64MMFR0_TGran4_ENABLED
> +
> +#endif
> .section".idmap.text", "ax"
>  __enable_mmu:
> +   mrs x1, ID_AA64MMFR0_EL1
> +   ubfxx2, x1, #ID_AA64MMFR0_TGran_SHIFT, 4
> +   cmp x2, #ID_AA64MMFR0_TGran_ENABLED
> +   b.ne__no_granule_support
> ldr x5, =vectors
> msr vbar_el1, x5
> msr ttbr0_el1, x25  // load TTBR0
> @@ -626,3 +643,8 @@ __enable_mmu:
> isb
> br  x27
>  ENDPROC(__enable_mmu)
> +
> +__no_granule_support:
> +   wfe
> +   b __no_granule_support
> +ENDPROC(__no_granule_support)
> --
> 1.7.9.5
>

Hi Suzuki,
Is is possible to tell the user that the kernel has failed to boot due
to the kernel granule being unsupported?

Cheers,
--
Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 3/7] arm64: Kprobes with single stepping support

2015-08-13 Thread Steve Capper
Hi David,

On 11 August 2015 at 01:52, David Long  wrote:
> From: Sandeepa Prabhu 
>
> Add support for basic kernel probes(kprobes) and jump probes
> (jprobes) for ARM64.
>
> Kprobes utilizes software breakpoint and single step debug
> exceptions supported on ARM v8.
>
> A software breakpoint is placed at the probe address to trap the
> kernel execution into the kprobe handler.
>
> ARM v8 supports enabling single stepping before the break exception
> return (ERET), with next PC in exception return address (ELR_EL1). The
> kprobe handler prepares an executable memory slot for out-of-line
> execution with a copy of the original instruction being probed, and
> enables single stepping. The PC is set to the out-of-line slot address
> before the ERET. With this scheme, the instruction is executed with the
> exact same register context except for the PC (and DAIF) registers.
>
> Debug mask (PSTATE.D) is enabled only when single stepping a recursive
> kprobe, e.g.: during kprobes reenter so that probed instruction can be
> single stepped within the kprobe handler -exception- context.
> The recursion depth of kprobe is always 2, i.e. upon probe re-entry,
> any further re-entry is prevented by not calling handlers and the case
> counted as a missed kprobe).
>
> Single stepping from the x-o-l slot has a drawback for PC-relative accesses
> like branching and symbolic literals access as the offset from the new PC
> (slot address) may not be ensured to fit in the immediate value of
> the opcode. Such instructions need simulation, so reject
> probing them.
>
> Instructions generating exceptions or cpu mode change are rejected
> for probing.
>
> Instructions using Exclusive Monitor are rejected too.
>
> System instructions are mostly enabled for stepping, except MSR/MRS
> accesses to "DAIF" flags in PSTATE, which are not safe for
> probing.
>
> Thanks to Steve Capper and Pratyush Anand for several suggested
> Changes.
>
> Signed-off-by: Sandeepa Prabhu 
> Signed-off-by: Steve Capper 

Please remove my SoB, we can replace it with a Reviewed-by hopefully soon :-).


> Signed-off-by: David A. Long 
> ---
>  arch/arm64/Kconfig  |   1 +
>  arch/arm64/include/asm/debug-monitors.h |   5 +
>  arch/arm64/include/asm/kprobes.h|  62 
>  arch/arm64/include/asm/probes.h |  50 +++
>  arch/arm64/include/asm/ptrace.h |   3 +-
>  arch/arm64/kernel/Makefile  |   1 +
>  arch/arm64/kernel/debug-monitors.c  |  35 ++-
>  arch/arm64/kernel/kprobes-arm64.c   |  68 
>  arch/arm64/kernel/kprobes-arm64.h   |  28 ++
>  arch/arm64/kernel/kprobes.c | 537 
> 
>  arch/arm64/kernel/kprobes.h |  24 ++
>  arch/arm64/kernel/vmlinux.lds.S |   1 +
>  arch/arm64/mm/fault.c   |  25 ++
>  13 files changed, 829 insertions(+), 11 deletions(-)
>  create mode 100644 arch/arm64/include/asm/kprobes.h
>  create mode 100644 arch/arm64/include/asm/probes.h
>  create mode 100644 arch/arm64/kernel/kprobes-arm64.c
>  create mode 100644 arch/arm64/kernel/kprobes-arm64.h
>  create mode 100644 arch/arm64/kernel/kprobes.c
>  create mode 100644 arch/arm64/kernel/kprobes.h
>

[...]

> +
> +void __kprobes kprobe_handler(struct pt_regs *regs)
> +{
> +   struct kprobe *p, *cur;
> +   struct kprobe_ctlblk *kcb;
> +   unsigned long addr = instruction_pointer(regs);
> +
> +   kcb = get_kprobe_ctlblk();
> +   cur = kprobe_running();
> +
> +   p = get_kprobe((kprobe_opcode_t *) addr);
> +
> +   if (p) {
> +   if (cur) {
> +   if (reenter_kprobe(p, regs, kcb))
> +   return;
> +   } else if (!p->ainsn.check_condn ||
> +  p->ainsn.check_condn(p, regs)) {
> +   /* Probe hit and conditional execution check ok. */
> +   set_current_kprobe(p);
> +   kcb->kprobe_status = KPROBE_HIT_ACTIVE;
> +
> +   /*
> +* If we have no pre-handler or it returned 0, we
> +* continue with normal processing.  If we have a
> +* pre-handler and it returned non-zero, it prepped
> +* for calling the break_handler below on re-entry,
> +* so get out doing nothing more here.
> +*
> +* pre_handler can hit a breakpoint and can step thru
> +* before return, keep PSTATE D-flag enabled until
> +* pre_handler return back.
> + 

Re: [PATCH v8 3/7] arm64: Kprobes with single stepping support

2015-08-13 Thread Steve Capper
Hi David,

On 11 August 2015 at 01:52, David Long dave.l...@linaro.org wrote:
 From: Sandeepa Prabhu sandeepa.s.pra...@gmail.com

 Add support for basic kernel probes(kprobes) and jump probes
 (jprobes) for ARM64.

 Kprobes utilizes software breakpoint and single step debug
 exceptions supported on ARM v8.

 A software breakpoint is placed at the probe address to trap the
 kernel execution into the kprobe handler.

 ARM v8 supports enabling single stepping before the break exception
 return (ERET), with next PC in exception return address (ELR_EL1). The
 kprobe handler prepares an executable memory slot for out-of-line
 execution with a copy of the original instruction being probed, and
 enables single stepping. The PC is set to the out-of-line slot address
 before the ERET. With this scheme, the instruction is executed with the
 exact same register context except for the PC (and DAIF) registers.

 Debug mask (PSTATE.D) is enabled only when single stepping a recursive
 kprobe, e.g.: during kprobes reenter so that probed instruction can be
 single stepped within the kprobe handler -exception- context.
 The recursion depth of kprobe is always 2, i.e. upon probe re-entry,
 any further re-entry is prevented by not calling handlers and the case
 counted as a missed kprobe).

 Single stepping from the x-o-l slot has a drawback for PC-relative accesses
 like branching and symbolic literals access as the offset from the new PC
 (slot address) may not be ensured to fit in the immediate value of
 the opcode. Such instructions need simulation, so reject
 probing them.

 Instructions generating exceptions or cpu mode change are rejected
 for probing.

 Instructions using Exclusive Monitor are rejected too.

 System instructions are mostly enabled for stepping, except MSR/MRS
 accesses to DAIF flags in PSTATE, which are not safe for
 probing.

 Thanks to Steve Capper and Pratyush Anand for several suggested
 Changes.

 Signed-off-by: Sandeepa Prabhu sandeepa.s.pra...@gmail.com
 Signed-off-by: Steve Capper steve.cap...@linaro.org

Please remove my SoB, we can replace it with a Reviewed-by hopefully soon :-).


 Signed-off-by: David A. Long dave.l...@linaro.org
 ---
  arch/arm64/Kconfig  |   1 +
  arch/arm64/include/asm/debug-monitors.h |   5 +
  arch/arm64/include/asm/kprobes.h|  62 
  arch/arm64/include/asm/probes.h |  50 +++
  arch/arm64/include/asm/ptrace.h |   3 +-
  arch/arm64/kernel/Makefile  |   1 +
  arch/arm64/kernel/debug-monitors.c  |  35 ++-
  arch/arm64/kernel/kprobes-arm64.c   |  68 
  arch/arm64/kernel/kprobes-arm64.h   |  28 ++
  arch/arm64/kernel/kprobes.c | 537 
 
  arch/arm64/kernel/kprobes.h |  24 ++
  arch/arm64/kernel/vmlinux.lds.S |   1 +
  arch/arm64/mm/fault.c   |  25 ++
  13 files changed, 829 insertions(+), 11 deletions(-)
  create mode 100644 arch/arm64/include/asm/kprobes.h
  create mode 100644 arch/arm64/include/asm/probes.h
  create mode 100644 arch/arm64/kernel/kprobes-arm64.c
  create mode 100644 arch/arm64/kernel/kprobes-arm64.h
  create mode 100644 arch/arm64/kernel/kprobes.c
  create mode 100644 arch/arm64/kernel/kprobes.h


[...]

 +
 +void __kprobes kprobe_handler(struct pt_regs *regs)
 +{
 +   struct kprobe *p, *cur;
 +   struct kprobe_ctlblk *kcb;
 +   unsigned long addr = instruction_pointer(regs);
 +
 +   kcb = get_kprobe_ctlblk();
 +   cur = kprobe_running();
 +
 +   p = get_kprobe((kprobe_opcode_t *) addr);
 +
 +   if (p) {
 +   if (cur) {
 +   if (reenter_kprobe(p, regs, kcb))
 +   return;
 +   } else if (!p-ainsn.check_condn ||
 +  p-ainsn.check_condn(p, regs)) {
 +   /* Probe hit and conditional execution check ok. */
 +   set_current_kprobe(p);
 +   kcb-kprobe_status = KPROBE_HIT_ACTIVE;
 +
 +   /*
 +* If we have no pre-handler or it returned 0, we
 +* continue with normal processing.  If we have a
 +* pre-handler and it returned non-zero, it prepped
 +* for calling the break_handler below on re-entry,
 +* so get out doing nothing more here.
 +*
 +* pre_handler can hit a breakpoint and can step thru
 +* before return, keep PSTATE D-flag enabled until
 +* pre_handler return back.
 +*/
 +   if (!p-pre_handler || !p-pre_handler(p, regs)) {
 +   kcb-kprobe_status = KPROBE_HIT_SS;
 +   setup_singlestep(p, regs, kcb, 0);
 +   return;
 +   }
 +   } else

Re: [PATCH 12/14] arm64: Check for selected granule support

2015-08-13 Thread Steve Capper
On 13 August 2015 at 12:34, Suzuki K. Poulose suzuki.poul...@arm.com wrote:
 From: Suzuki K. Poulose suzuki.poul...@arm.com

 Ensure that the selected page size is supported by the
 CPU(s).

 Cc: Mark Rutland mark.rutl...@arm.com
 Cc: Catalin Marinas catalin.mari...@arm.com
 Cc: Will Deacon will.dea...@arm.com
 Signed-off-by: Suzuki K. Poulose suzuki.poul...@arm.com
 ---
  arch/arm64/include/asm/sysreg.h |6 ++
  arch/arm64/kernel/head.S|   24 +++-
  2 files changed, 29 insertions(+), 1 deletion(-)

 diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
 index a7f3d4b..e01d323 100644
 --- a/arch/arm64/include/asm/sysreg.h
 +++ b/arch/arm64/include/asm/sysreg.h
 @@ -87,4 +87,10 @@ static inline void config_sctlr_el1(u32 clear, u32 set)
  }
  #endif

 +#define ID_AA64MMFR0_TGran4_SHIFT  28
 +#define ID_AA64MMFR0_TGran64_SHIFT 24
 +
 +#define ID_AA64MMFR0_TGran4_ENABLED0x0
 +#define ID_AA64MMFR0_TGran64_ENABLED   0x0
 +
  #endif /* __ASM_SYSREG_H */
 diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
 index 01b8e58..0cb04db 100644
 --- a/arch/arm64/kernel/head.S
 +++ b/arch/arm64/kernel/head.S
 @@ -31,10 +31,11 @@
  #include asm/cputype.h
  #include asm/kernel-pgtable.h
  #include asm/memory.h
 -#include asm/thread_info.h
  #include asm/pgtable-hwdef.h
  #include asm/pgtable.h
  #include asm/page.h
 +#include asm/sysreg.h
 +#include asm/thread_info.h
  #include asm/virt.h

  #define __PHYS_OFFSET  (KERNEL_START - TEXT_OFFSET)
 @@ -606,9 +607,25 @@ ENDPROC(__secondary_switched)
   *  x27 = *virtual* address to jump to upon completion
   *
   * other registers depend on the function called upon completion
 + * Checks if the selected granule size is supported by the CPU.
   */
 +#ifdefined(CONFIG_ARM64_64K_PAGES)
 +
 +#define ID_AA64MMFR0_TGran_SHIFT   ID_AA64MMFR0_TGran64_SHIFT
 +#define ID_AA64MMFR0_TGran_ENABLED ID_AA64MMFR0_TGran64_ENABLED
 +
 +#else
 +
 +#define ID_AA64MMFR0_TGran_SHIFT   ID_AA64MMFR0_TGran4_SHIFT
 +#define ID_AA64MMFR0_TGran_ENABLED ID_AA64MMFR0_TGran4_ENABLED
 +
 +#endif
 .section.idmap.text, ax
  __enable_mmu:
 +   mrs x1, ID_AA64MMFR0_EL1
 +   ubfxx2, x1, #ID_AA64MMFR0_TGran_SHIFT, 4
 +   cmp x2, #ID_AA64MMFR0_TGran_ENABLED
 +   b.ne__no_granule_support
 ldr x5, =vectors
 msr vbar_el1, x5
 msr ttbr0_el1, x25  // load TTBR0
 @@ -626,3 +643,8 @@ __enable_mmu:
 isb
 br  x27
  ENDPROC(__enable_mmu)
 +
 +__no_granule_support:
 +   wfe
 +   b __no_granule_support
 +ENDPROC(__no_granule_support)
 --
 1.7.9.5


Hi Suzuki,
Is is possible to tell the user that the kernel has failed to boot due
to the kernel granule being unsupported?

Cheers,
--
Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] kprobes: Update examples to target _do_fork

2015-08-12 Thread Steve Capper
In commit 3033f14ab78c ("clone: support passing tls argument via C
rather than pt_regs magic"), the kernel calls _do_fork in places where
it previously called do_fork.

Unfortunately, the kprobe examples target do_fork; thus no events
appear to fire when one runs the example modules.

This commit updates the kprobe example code s.t. _do_fork is targeted
instead, and the examples work as expected.

Signed-off-by: Steve Capper 
---
 samples/kprobes/jprobe_example.c| 8 
 samples/kprobes/kprobe_example.c| 2 +-
 samples/kprobes/kretprobe_example.c | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/samples/kprobes/jprobe_example.c b/samples/kprobes/jprobe_example.c
index 9119ac6..11dd54b 100644
--- a/samples/kprobes/jprobe_example.c
+++ b/samples/kprobes/jprobe_example.c
@@ -23,9 +23,9 @@
  */
 
 /* Proxy routine having the same arguments as actual do_fork() routine */
-static long jdo_fork(unsigned long clone_flags, unsigned long stack_start,
+static long j_do_fork(unsigned long clone_flags, unsigned long stack_start,
  unsigned long stack_size, int __user *parent_tidptr,
- int __user *child_tidptr)
+ int __user *child_tidptr, unsigned long tls)
 {
pr_info("jprobe: clone_flags = 0x%lx, stack_start = 0x%lx "
"stack_size = 0x%lx\n", clone_flags, stack_start, stack_size);
@@ -36,9 +36,9 @@ static long jdo_fork(unsigned long clone_flags, unsigned long 
stack_start,
 }
 
 static struct jprobe my_jprobe = {
-   .entry  = jdo_fork,
+   .entry  = j_do_fork,
.kp = {
-   .symbol_name= "do_fork",
+   .symbol_name= "_do_fork",
},
 };
 
diff --git a/samples/kprobes/kprobe_example.c b/samples/kprobes/kprobe_example.c
index 51d459c..597e101 100644
--- a/samples/kprobes/kprobe_example.c
+++ b/samples/kprobes/kprobe_example.c
@@ -16,7 +16,7 @@
 
 /* For each probe you need to allocate a kprobe structure */
 static struct kprobe kp = {
-   .symbol_name= "do_fork",
+   .symbol_name= "_do_fork",
 };
 
 /* kprobe pre_handler: called just before the probed instruction is executed */
diff --git a/samples/kprobes/kretprobe_example.c 
b/samples/kprobes/kretprobe_example.c
index 1041b67..a270535 100644
--- a/samples/kprobes/kretprobe_example.c
+++ b/samples/kprobes/kretprobe_example.c
@@ -25,7 +25,7 @@
 #include 
 #include 
 
-static char func_name[NAME_MAX] = "do_fork";
+static char func_name[NAME_MAX] = "_do_fork";
 module_param_string(func, func_name, NAME_MAX, S_IRUGO);
 MODULE_PARM_DESC(func, "Function to kretprobe; this module will report the"
" function's execution time");
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 7/7] kprobes: Add arm64 case in kprobe example module

2015-08-12 Thread Steve Capper
On 11 August 2015 at 01:52, David Long  wrote:
> From: Sandeepa Prabhu 
>
> Add info prints in sample kprobe handlers for ARM64
>
> Signed-off-by: Sandeepa Prabhu 
> ---
>  samples/kprobes/kprobe_example.c | 8 
>  1 file changed, 8 insertions(+)

I'm not going through this series backwards, but I did run the kprobe
sample modules first, and nothing happened... (i.e. nothing fired).

The kernel usage of do_fork (which is used as an example by the sample
code) has been changed by:
3033f14a clone: support passing tls argument via C rather than pt_regs magic

Now everything appears to go through _do_fork rather than do_fork.

I'll send a fixup shortly, but if anyone else is running these modules
and worrying about a lack of events... worry less :-).

Cheers,
--
Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] kprobes: Update examples to target _do_fork

2015-08-12 Thread Steve Capper
In commit 3033f14ab78c (clone: support passing tls argument via C
rather than pt_regs magic), the kernel calls _do_fork in places where
it previously called do_fork.

Unfortunately, the kprobe examples target do_fork; thus no events
appear to fire when one runs the example modules.

This commit updates the kprobe example code s.t. _do_fork is targeted
instead, and the examples work as expected.

Signed-off-by: Steve Capper steve.cap...@linaro.org
---
 samples/kprobes/jprobe_example.c| 8 
 samples/kprobes/kprobe_example.c| 2 +-
 samples/kprobes/kretprobe_example.c | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/samples/kprobes/jprobe_example.c b/samples/kprobes/jprobe_example.c
index 9119ac6..11dd54b 100644
--- a/samples/kprobes/jprobe_example.c
+++ b/samples/kprobes/jprobe_example.c
@@ -23,9 +23,9 @@
  */
 
 /* Proxy routine having the same arguments as actual do_fork() routine */
-static long jdo_fork(unsigned long clone_flags, unsigned long stack_start,
+static long j_do_fork(unsigned long clone_flags, unsigned long stack_start,
  unsigned long stack_size, int __user *parent_tidptr,
- int __user *child_tidptr)
+ int __user *child_tidptr, unsigned long tls)
 {
pr_info(jprobe: clone_flags = 0x%lx, stack_start = 0x%lx 
stack_size = 0x%lx\n, clone_flags, stack_start, stack_size);
@@ -36,9 +36,9 @@ static long jdo_fork(unsigned long clone_flags, unsigned long 
stack_start,
 }
 
 static struct jprobe my_jprobe = {
-   .entry  = jdo_fork,
+   .entry  = j_do_fork,
.kp = {
-   .symbol_name= do_fork,
+   .symbol_name= _do_fork,
},
 };
 
diff --git a/samples/kprobes/kprobe_example.c b/samples/kprobes/kprobe_example.c
index 51d459c..597e101 100644
--- a/samples/kprobes/kprobe_example.c
+++ b/samples/kprobes/kprobe_example.c
@@ -16,7 +16,7 @@
 
 /* For each probe you need to allocate a kprobe structure */
 static struct kprobe kp = {
-   .symbol_name= do_fork,
+   .symbol_name= _do_fork,
 };
 
 /* kprobe pre_handler: called just before the probed instruction is executed */
diff --git a/samples/kprobes/kretprobe_example.c 
b/samples/kprobes/kretprobe_example.c
index 1041b67..a270535 100644
--- a/samples/kprobes/kretprobe_example.c
+++ b/samples/kprobes/kretprobe_example.c
@@ -25,7 +25,7 @@
 #include linux/limits.h
 #include linux/sched.h
 
-static char func_name[NAME_MAX] = do_fork;
+static char func_name[NAME_MAX] = _do_fork;
 module_param_string(func, func_name, NAME_MAX, S_IRUGO);
 MODULE_PARM_DESC(func, Function to kretprobe; this module will report the
 function's execution time);
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 7/7] kprobes: Add arm64 case in kprobe example module

2015-08-12 Thread Steve Capper
On 11 August 2015 at 01:52, David Long dave.l...@linaro.org wrote:
 From: Sandeepa Prabhu sandeepa.s.pra...@gmail.com

 Add info prints in sample kprobe handlers for ARM64

 Signed-off-by: Sandeepa Prabhu sandeepa.s.pra...@gmail.com
 ---
  samples/kprobes/kprobe_example.c | 8 
  1 file changed, 8 insertions(+)

I'm not going through this series backwards, but I did run the kprobe
sample modules first, and nothing happened... (i.e. nothing fired).

The kernel usage of do_fork (which is used as an example by the sample
code) has been changed by:
3033f14a clone: support passing tls argument via C rather than pt_regs magic

Now everything appears to go through _do_fork rather than do_fork.

I'll send a fixup shortly, but if anyone else is running these modules
and worrying about a lack of events... worry less :-).

Cheers,
--
Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH] arm64: cpuinfo: Expose MIDR_EL1 and REVIDR_EL1 to sysfs

2015-07-24 Thread Steve Capper
It can be useful for JIT software to be aware of MIDR_EL1 and
REVIDR_EL1 to ascertain the presence of any core errata that could
affect codegen.

This patch exposes these registers through sysfs:

/sys/devices/system/cpu/cpu$ID/identification/midr
/sys/devices/system/cpu/cpu$ID/identification/revidr

where $ID is the cpu number. For big.LITTLE systems, one can have a
mixture of cores (e.g. Cortex A53 and Cortex A57), thus all CPUs need
to be enumerated.

If the kernel does not have valid information to populate these entries
with, an empty string is returned to userspace.

Signed-off-by: Steve Capper 
---

Hello,

This RFC is meant to sit on top of Suzuki's set at:
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/358990.html

On systems with different core types (for instance big.LITTLE systems),
we need to be *very* careful that the REVIDR and MIDR are both read
from the same core. Thus these registers are exposed in /sys rather
than via MRS emulation.

Cheers,
-- 
Steve

---
 arch/arm64/include/asm/cpu.h |  1 +
 arch/arm64/kernel/cpuinfo.c  | 48 
 2 files changed, 49 insertions(+)

diff --git a/arch/arm64/include/asm/cpu.h b/arch/arm64/include/asm/cpu.h
index 17a871d..048b7bf 100644
--- a/arch/arm64/include/asm/cpu.h
+++ b/arch/arm64/include/asm/cpu.h
@@ -187,6 +187,7 @@ struct cpuinfo_arm64 {
u32 reg_cntfrq;
u32 reg_dczid;
u32 reg_midr;
+   u32 reg_revidr;
 
u64 reg_id_aa64dfr0;
u64 reg_id_aa64dfr1;
diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c
index 678e7f6..d50eda1 100644
--- a/arch/arm64/kernel/cpuinfo.c
+++ b/arch/arm64/kernel/cpuinfo.c
@@ -533,6 +533,7 @@ static void __cpuinfo_store_cpu(struct cpuinfo_arm64 *info)
info->reg_ctr = read_cpuid_cachetype();
info->reg_dczid = read_cpuid(DCZID_EL0);
info->reg_midr = read_cpuid_id();
+   info->reg_revidr = read_cpuid(REVIDR_EL1);
 
info->reg_id_aa64dfr0 = read_cpuid(ID_AA64DFR0_EL1);
info->reg_id_aa64dfr1 = read_cpuid(ID_AA64DFR1_EL1);
@@ -886,3 +887,50 @@ int __init arm64_cpuinfo_init(void)
 }
 
 late_initcall(arm64_cpuinfo_init);
+
+#define CPUINFO_ATTR_RO(_name) 
\
+   static ssize_t show_##_name (struct device *dev,
\
+   struct device_attribute *attr, char *buf)   
\
+   {   
\
+   struct cpuinfo_arm64 *info = _cpu(cpu_data, dev->id);   
\
+   
\
+   if (info->reg_midr) 
\
+   return sprintf(buf, "0x%016x\n", info->reg_##_name);
\
+   else
\
+   return 0;   
\
+   }   
\
+   static DEVICE_ATTR(_name, 0444, show_##_name, NULL)
+
+CPUINFO_ATTR_RO(midr);
+CPUINFO_ATTR_RO(revidr);
+
+static struct attribute *cpuregs_attrs[] = {
+   _attr_midr.attr,
+   _attr_revidr.attr,
+   NULL
+};
+
+static struct attribute_group cpuregs_attr_group = {
+   .attrs = cpuregs_attrs,
+   .name = "identification"
+};
+
+static int __init cpuinfo_regs_init(void)
+{
+   int cpu, ret;
+
+   for_each_present_cpu(cpu) {
+   struct device *dev = get_cpu_device(cpu);
+
+   if (!dev)
+   return -1;
+
+   ret = sysfs_create_group(>kobj, _attr_group);
+   if (ret)
+   return ret;
+   }
+
+   return 0;
+}
+
+device_initcall(cpuinfo_regs_init);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH] arm64: cpuinfo: Expose MIDR_EL1 and REVIDR_EL1 to sysfs

2015-07-24 Thread Steve Capper
It can be useful for JIT software to be aware of MIDR_EL1 and
REVIDR_EL1 to ascertain the presence of any core errata that could
affect codegen.

This patch exposes these registers through sysfs:

/sys/devices/system/cpu/cpu$ID/identification/midr
/sys/devices/system/cpu/cpu$ID/identification/revidr

where $ID is the cpu number. For big.LITTLE systems, one can have a
mixture of cores (e.g. Cortex A53 and Cortex A57), thus all CPUs need
to be enumerated.

If the kernel does not have valid information to populate these entries
with, an empty string is returned to userspace.

Signed-off-by: Steve Capper steve.cap...@linaro.org
---

Hello,

This RFC is meant to sit on top of Suzuki's set at:
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/358990.html

On systems with different core types (for instance big.LITTLE systems),
we need to be *very* careful that the REVIDR and MIDR are both read
from the same core. Thus these registers are exposed in /sys rather
than via MRS emulation.

Cheers,
-- 
Steve

---
 arch/arm64/include/asm/cpu.h |  1 +
 arch/arm64/kernel/cpuinfo.c  | 48 
 2 files changed, 49 insertions(+)

diff --git a/arch/arm64/include/asm/cpu.h b/arch/arm64/include/asm/cpu.h
index 17a871d..048b7bf 100644
--- a/arch/arm64/include/asm/cpu.h
+++ b/arch/arm64/include/asm/cpu.h
@@ -187,6 +187,7 @@ struct cpuinfo_arm64 {
u32 reg_cntfrq;
u32 reg_dczid;
u32 reg_midr;
+   u32 reg_revidr;
 
u64 reg_id_aa64dfr0;
u64 reg_id_aa64dfr1;
diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c
index 678e7f6..d50eda1 100644
--- a/arch/arm64/kernel/cpuinfo.c
+++ b/arch/arm64/kernel/cpuinfo.c
@@ -533,6 +533,7 @@ static void __cpuinfo_store_cpu(struct cpuinfo_arm64 *info)
info-reg_ctr = read_cpuid_cachetype();
info-reg_dczid = read_cpuid(DCZID_EL0);
info-reg_midr = read_cpuid_id();
+   info-reg_revidr = read_cpuid(REVIDR_EL1);
 
info-reg_id_aa64dfr0 = read_cpuid(ID_AA64DFR0_EL1);
info-reg_id_aa64dfr1 = read_cpuid(ID_AA64DFR1_EL1);
@@ -886,3 +887,50 @@ int __init arm64_cpuinfo_init(void)
 }
 
 late_initcall(arm64_cpuinfo_init);
+
+#define CPUINFO_ATTR_RO(_name) 
\
+   static ssize_t show_##_name (struct device *dev,
\
+   struct device_attribute *attr, char *buf)   
\
+   {   
\
+   struct cpuinfo_arm64 *info = per_cpu(cpu_data, dev-id);   
\
+   
\
+   if (info-reg_midr) 
\
+   return sprintf(buf, 0x%016x\n, info-reg_##_name);
\
+   else
\
+   return 0;   
\
+   }   
\
+   static DEVICE_ATTR(_name, 0444, show_##_name, NULL)
+
+CPUINFO_ATTR_RO(midr);
+CPUINFO_ATTR_RO(revidr);
+
+static struct attribute *cpuregs_attrs[] = {
+   dev_attr_midr.attr,
+   dev_attr_revidr.attr,
+   NULL
+};
+
+static struct attribute_group cpuregs_attr_group = {
+   .attrs = cpuregs_attrs,
+   .name = identification
+};
+
+static int __init cpuinfo_regs_init(void)
+{
+   int cpu, ret;
+
+   for_each_present_cpu(cpu) {
+   struct device *dev = get_cpu_device(cpu);
+
+   if (!dev)
+   return -1;
+
+   ret = sysfs_create_group(dev-kobj, cpuregs_attr_group);
+   if (ret)
+   return ret;
+   }
+
+   return 0;
+}
+
+device_initcall(cpuinfo_regs_init);
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v7 5/7] arm64: Add trampoline code for kretprobes

2015-06-30 Thread Steve Capper
On 29 June 2015 at 19:16, William Cohen  wrote:
> On 06/29/2015 01:25 PM, Steve Capper wrote:
>> On 15 June 2015 at 20:07, David Long  wrote:
>>> From: William Cohen 
>>>
>>> The trampoline code is used by kretprobes to capture a return from a probed
>>> function.  This is done by saving the registers, calling the handler, and
>>> restoring the registers.  The code then returns to the roginal saved caller
>>> return address.  It is necessary to do this directly instead of using a
>>> software breakpoint because the code used in processing that breakpoint
>>> could itself be kprobe'd and cause a problematic reentry into the debug
>>> exception handler.
>>>
>>> Signed-off-by: William Cohen 
>>> Signed-off-by: David A. Long 
>>> ---
>>>  arch/arm64/include/asm/kprobes.h  |  1 +
>>>  arch/arm64/kernel/kprobes-arm64.h | 41 
>>> +++
>>>  arch/arm64/kernel/kprobes.c   | 26 +
>>>  3 files changed, 68 insertions(+)
>>>
>>> diff --git a/arch/arm64/include/asm/kprobes.h 
>>> b/arch/arm64/include/asm/kprobes.h
>>> index af31c4d..d081f49 100644
>>> --- a/arch/arm64/include/asm/kprobes.h
>>> +++ b/arch/arm64/include/asm/kprobes.h
>>> @@ -58,5 +58,6 @@ int kprobe_exceptions_notify(struct notifier_block *self,
>>>  unsigned long val, void *data);
>>>  int kprobe_breakpoint_handler(struct pt_regs *regs, unsigned int esr);
>>>  int kprobe_single_step_handler(struct pt_regs *regs, unsigned int esr);
>>> +void kretprobe_trampoline(void);
>>>
>>>  #endif /* _ARM_KPROBES_H */
>>> diff --git a/arch/arm64/kernel/kprobes-arm64.h 
>>> b/arch/arm64/kernel/kprobes-arm64.h
>>> index ff8a55f..bdcfa62 100644
>>> --- a/arch/arm64/kernel/kprobes-arm64.h
>>> +++ b/arch/arm64/kernel/kprobes-arm64.h
>>> @@ -27,4 +27,45 @@ extern kprobes_pstate_check_t * const 
>>> kprobe_condition_checks[16];
>>>  enum kprobe_insn __kprobes
>>>  arm_kprobe_decode_insn(kprobe_opcode_t insn, struct arch_specific_insn 
>>> *asi);
>>>
>>> +#define SAVE_REGS_STRING\
>>> +   "   stp x0, x1, [sp, #16 * 0]\n"\
>>> +   "   stp x2, x3, [sp, #16 * 1]\n"\
>>> +   "   stp x4, x5, [sp, #16 * 2]\n"\
>>> +   "   stp x6, x7, [sp, #16 * 3]\n"\
>>> +   "   stp x8, x9, [sp, #16 * 4]\n"\
>>> +   "   stp x10, x11, [sp, #16 * 5]\n"  \
>>> +   "   stp x12, x13, [sp, #16 * 6]\n"  \
>>> +   "   stp x14, x15, [sp, #16 * 7]\n"  \
>>> +   "   stp x16, x17, [sp, #16 * 8]\n"  \
>>> +   "   stp x18, x19, [sp, #16 * 9]\n"  \
>>> +   "   stp x20, x21, [sp, #16 * 10]\n" \
>>> +   "   stp x22, x23, [sp, #16 * 11]\n" \
>>> +   "   stp x24, x25, [sp, #16 * 12]\n" \
>>> +   "   stp x26, x27, [sp, #16 * 13]\n" \
>>> +   "   stp x28, x29, [sp, #16 * 14]\n" \
>>> +   "   str x30,   [sp, #16 * 15]\n"\
>>> +   "   mrs x0, nzcv\n" \
>>> +   "   str x0, [sp, #8 * 33]\n"
>>> +
>>> +
>>> +#define RESTORE_REGS_STRING\
>>> +   "   ldr x0, [sp, #8 * 33]\n"\
>>> +   "   msr nzcv, x0\n" \
>>> +   "   ldp x0, x1, [sp, #16 * 0]\n"\
>>> +   "   ldp x2, x3, [sp, #16 * 1]\n"\
>>> +   "   ldp x4, x5, [sp, #16 * 2]\n"\
>>> +   "   ldp x6, x7, [sp, #16 * 3]\n"\
>>> +   "   ldp x8, x9, [sp, #16 * 4]\n"\
>>> +   "   ldp x10, x11, [sp, #16 * 5]\n"  \
>>> +   "   ldp x12, x13, [sp, #16 * 6]\n"  \
>>> +   "   ldp x14, x15, [sp, #16 * 7]\n"  \
>>> +   "   ldp x16, x17, [sp, #16 * 8]\n"  \
>>> +   "   ldp x18, x19, [sp, #16 * 9]\n"  \
>>> +   "   ldp x20, x21, [sp, #16 * 10]\n" \
>>> +   "   ldp x22, x23, [sp, #16 * 11]\n" \
>>> +   "   ldp x24, x25, [sp, #16 * 12]\n" \
>>> +   "   ldp x26, x27, [sp, #16 * 13]\n" \
>>> +   

Re: [PATCH v7 1/7] arm64: Add HAVE_REGS_AND_STACK_ACCESS_API feature

2015-06-30 Thread Steve Capper
On 29 June 2015 at 19:36, David Long  wrote:
> On 06/29/15 13:23, Steve Capper wrote:
>>
>> On 15 June 2015 at 20:07, David Long  wrote:
>>>
>>> From: "David A. Long" 
>>>
>>> Add HAVE_REGS_AND_STACK_ACCESS_API feature for arm64.
>>>
>>> Signed-off-by: David A. Long 
>>> ---
>>>   arch/arm64/Kconfig  |  1 +
>>>   arch/arm64/include/asm/ptrace.h | 25 +
>>>   arch/arm64/kernel/ptrace.c  | 77
>>> +
>>>   3 files changed, 103 insertions(+)
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index 7796af4..966091f 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -68,6 +68,7 @@ config ARM64
>>>  select HAVE_PERF_EVENTS
>>>  select HAVE_PERF_REGS
>>>  select HAVE_PERF_USER_STACK_DUMP
>>> +   select HAVE_REGS_AND_STACK_ACCESS_API
>>>  select HAVE_RCU_TABLE_FREE
>>>  select HAVE_SYSCALL_TRACEPOINTS
>>>  select IRQ_DOMAIN
>>> diff --git a/arch/arm64/include/asm/ptrace.h
>>> b/arch/arm64/include/asm/ptrace.h
>>> index d6dd9fd..8f440e9 100644
>>> --- a/arch/arm64/include/asm/ptrace.h
>>> +++ b/arch/arm64/include/asm/ptrace.h
>>> @@ -118,6 +118,8 @@ struct pt_regs {
>>>  u64 syscallno;
>>>   };
>>>
>>> +#define MAX_REG_OFFSET (sizeof(struct user_pt_regs) - sizeof(u64))
>>> +
>>>   #define arch_has_single_step() (1)
>>>
>>>   #ifdef CONFIG_COMPAT
>>> @@ -146,6 +148,29 @@ struct pt_regs {
>>>   #define user_stack_pointer(regs) \
>>>  (!compat_user_mode(regs) ? (regs)->sp : (regs)->compat_sp)
>>>
>>> +/**
>>> + * regs_get_register() - get register value from its offset
>>> + * @regs: pt_regs from which register value is gotten
>>> + * @offset:offset number of the register.
>>> + *
>>> + * regs_get_register returns the value of a register whose offset from
>>> @regs.
>>> + * The @offset is the offset of the register in struct pt_regs.
>>> + * If @offset is bigger than MAX_REG_OFFSET, this returns 0.
>>> + */
>>> +static inline u64 regs_get_register(struct pt_regs *regs,
>>> + unsigned int offset)
>>> +{
>>> +   if (unlikely(offset > MAX_REG_OFFSET))
>>> +   return 0;
>>> +   return *(u64 *)((u64)regs + offset);
>>
>>
>> Why not:
>> return regs->regs[offset];
>>
>
> This would not be correct.  The offset is a byte offset and your code would
> index eight times that amount into the structure.  The offset needs to
> remain a byte offset so architecture-independent code does not need to know
> the architecture-specific layout of the structure.

Ahh, apologies. Thank you, I substituted offset as index in my head.

>
>
>>> +}
>>> +
>>> +/* Valid only for Kernel mode traps. */
>>> +static inline unsigned long kernel_stack_pointer(struct pt_regs *regs)
>>> +{
>>> +   return regs->sp;
>>> +}
>>> +
>>>   static inline unsigned long regs_return_value(struct pt_regs *regs)
>>>   {
>>>  return regs->regs[0];
>>> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
>>> index d882b83..f6199a5 100644
>>> --- a/arch/arm64/kernel/ptrace.c
>>> +++ b/arch/arm64/kernel/ptrace.c
>>> @@ -48,6 +48,83 @@
>>>   #define CREATE_TRACE_POINTS
>>>   #include 
>>>
>>> +#define ARM_pstate pstate
>>> +#define ARM_pc pc
>>> +#define ARM_sp sp
>>> +#define ARM_x30regs[30]
>>> +#define ARM_x29regs[29]
>>> +#define ARM_x28regs[28]
>>> +#define ARM_x27regs[27]
>>> +#define ARM_x26regs[26]
>>> +#define ARM_x25regs[25]
>>> +#define ARM_x24regs[24]
>>> +#define ARM_x23regs[23]
>>> +#define ARM_x22regs[22]
>>> +#define ARM_x21regs[21]
>>> +#define ARM_x20regs[20]
>>> +#define ARM_x19regs[19]
>>> +#define ARM_x18regs[18]
>>> +#define ARM_x17regs[17]
>>> +#define ARM_x16regs[1

Re: [PATCH v7 1/7] arm64: Add HAVE_REGS_AND_STACK_ACCESS_API feature

2015-06-30 Thread Steve Capper
On 29 June 2015 at 19:36, David Long dave.l...@linaro.org wrote:
 On 06/29/15 13:23, Steve Capper wrote:

 On 15 June 2015 at 20:07, David Long dave.l...@linaro.org wrote:

 From: David A. Long dave.l...@linaro.org

 Add HAVE_REGS_AND_STACK_ACCESS_API feature for arm64.

 Signed-off-by: David A. Long dave.l...@linaro.org
 ---
   arch/arm64/Kconfig  |  1 +
   arch/arm64/include/asm/ptrace.h | 25 +
   arch/arm64/kernel/ptrace.c  | 77
 +
   3 files changed, 103 insertions(+)

 diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
 index 7796af4..966091f 100644
 --- a/arch/arm64/Kconfig
 +++ b/arch/arm64/Kconfig
 @@ -68,6 +68,7 @@ config ARM64
  select HAVE_PERF_EVENTS
  select HAVE_PERF_REGS
  select HAVE_PERF_USER_STACK_DUMP
 +   select HAVE_REGS_AND_STACK_ACCESS_API
  select HAVE_RCU_TABLE_FREE
  select HAVE_SYSCALL_TRACEPOINTS
  select IRQ_DOMAIN
 diff --git a/arch/arm64/include/asm/ptrace.h
 b/arch/arm64/include/asm/ptrace.h
 index d6dd9fd..8f440e9 100644
 --- a/arch/arm64/include/asm/ptrace.h
 +++ b/arch/arm64/include/asm/ptrace.h
 @@ -118,6 +118,8 @@ struct pt_regs {
  u64 syscallno;
   };

 +#define MAX_REG_OFFSET (sizeof(struct user_pt_regs) - sizeof(u64))
 +
   #define arch_has_single_step() (1)

   #ifdef CONFIG_COMPAT
 @@ -146,6 +148,29 @@ struct pt_regs {
   #define user_stack_pointer(regs) \
  (!compat_user_mode(regs) ? (regs)-sp : (regs)-compat_sp)

 +/**
 + * regs_get_register() - get register value from its offset
 + * @regs: pt_regs from which register value is gotten
 + * @offset:offset number of the register.
 + *
 + * regs_get_register returns the value of a register whose offset from
 @regs.
 + * The @offset is the offset of the register in struct pt_regs.
 + * If @offset is bigger than MAX_REG_OFFSET, this returns 0.
 + */
 +static inline u64 regs_get_register(struct pt_regs *regs,
 + unsigned int offset)
 +{
 +   if (unlikely(offset  MAX_REG_OFFSET))
 +   return 0;
 +   return *(u64 *)((u64)regs + offset);


 Why not:
 return regs-regs[offset];


 This would not be correct.  The offset is a byte offset and your code would
 index eight times that amount into the structure.  The offset needs to
 remain a byte offset so architecture-independent code does not need to know
 the architecture-specific layout of the structure.

Ahh, apologies. Thank you, I substituted offset as index in my head.



 +}
 +
 +/* Valid only for Kernel mode traps. */
 +static inline unsigned long kernel_stack_pointer(struct pt_regs *regs)
 +{
 +   return regs-sp;
 +}
 +
   static inline unsigned long regs_return_value(struct pt_regs *regs)
   {
  return regs-regs[0];
 diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
 index d882b83..f6199a5 100644
 --- a/arch/arm64/kernel/ptrace.c
 +++ b/arch/arm64/kernel/ptrace.c
 @@ -48,6 +48,83 @@
   #define CREATE_TRACE_POINTS
   #include trace/events/syscalls.h

 +#define ARM_pstate pstate
 +#define ARM_pc pc
 +#define ARM_sp sp
 +#define ARM_x30regs[30]
 +#define ARM_x29regs[29]
 +#define ARM_x28regs[28]
 +#define ARM_x27regs[27]
 +#define ARM_x26regs[26]
 +#define ARM_x25regs[25]
 +#define ARM_x24regs[24]
 +#define ARM_x23regs[23]
 +#define ARM_x22regs[22]
 +#define ARM_x21regs[21]
 +#define ARM_x20regs[20]
 +#define ARM_x19regs[19]
 +#define ARM_x18regs[18]
 +#define ARM_x17regs[17]
 +#define ARM_x16regs[16]
 +#define ARM_x15regs[15]
 +#define ARM_x14regs[14]
 +#define ARM_x13regs[13]
 +#define ARM_x12regs[12]
 +#define ARM_x11regs[11]
 +#define ARM_x10regs[10]
 +#define ARM_x9 regs[9]
 +#define ARM_x8 regs[8]
 +#define ARM_x7 regs[7]
 +#define ARM_x6 regs[6]
 +#define ARM_x5 regs[5]
 +#define ARM_x4 regs[4]
 +#define ARM_x3 regs[3]
 +#define ARM_x2 regs[2]
 +#define ARM_x1 regs[1]
 +#define ARM_x0 regs[0]
 +
 +#define REG_OFFSET_NAME(r) \
 +   {.name = #r, .offset = offsetof(struct pt_regs, ARM_##r)}
 +#define REG_OFFSET_END {.name = NULL, .offset = 0}
 +
 +const struct pt_regs_offset regs_offset_table[] = {
 +   REG_OFFSET_NAME(x0),
 +   REG_OFFSET_NAME(x1),
 +   REG_OFFSET_NAME(x2),
 +   REG_OFFSET_NAME(x3),
 +   REG_OFFSET_NAME(x4),
 +   REG_OFFSET_NAME(x5),
 +   REG_OFFSET_NAME(x6),
 +   REG_OFFSET_NAME(x7),
 +   REG_OFFSET_NAME(x8),
 +   REG_OFFSET_NAME(x9),
 +   REG_OFFSET_NAME(x10),
 +   REG_OFFSET_NAME(x11),
 +   REG_OFFSET_NAME(x12

Re: [PATCH v7 5/7] arm64: Add trampoline code for kretprobes

2015-06-30 Thread Steve Capper
On 29 June 2015 at 19:16, William Cohen wco...@redhat.com wrote:
 On 06/29/2015 01:25 PM, Steve Capper wrote:
 On 15 June 2015 at 20:07, David Long dave.l...@linaro.org wrote:
 From: William Cohen wco...@redhat.com

 The trampoline code is used by kretprobes to capture a return from a probed
 function.  This is done by saving the registers, calling the handler, and
 restoring the registers.  The code then returns to the roginal saved caller
 return address.  It is necessary to do this directly instead of using a
 software breakpoint because the code used in processing that breakpoint
 could itself be kprobe'd and cause a problematic reentry into the debug
 exception handler.

 Signed-off-by: William Cohen wco...@redhat.com
 Signed-off-by: David A. Long dave.l...@linaro.org
 ---
  arch/arm64/include/asm/kprobes.h  |  1 +
  arch/arm64/kernel/kprobes-arm64.h | 41 
 +++
  arch/arm64/kernel/kprobes.c   | 26 +
  3 files changed, 68 insertions(+)

 diff --git a/arch/arm64/include/asm/kprobes.h 
 b/arch/arm64/include/asm/kprobes.h
 index af31c4d..d081f49 100644
 --- a/arch/arm64/include/asm/kprobes.h
 +++ b/arch/arm64/include/asm/kprobes.h
 @@ -58,5 +58,6 @@ int kprobe_exceptions_notify(struct notifier_block *self,
  unsigned long val, void *data);
  int kprobe_breakpoint_handler(struct pt_regs *regs, unsigned int esr);
  int kprobe_single_step_handler(struct pt_regs *regs, unsigned int esr);
 +void kretprobe_trampoline(void);

  #endif /* _ARM_KPROBES_H */
 diff --git a/arch/arm64/kernel/kprobes-arm64.h 
 b/arch/arm64/kernel/kprobes-arm64.h
 index ff8a55f..bdcfa62 100644
 --- a/arch/arm64/kernel/kprobes-arm64.h
 +++ b/arch/arm64/kernel/kprobes-arm64.h
 @@ -27,4 +27,45 @@ extern kprobes_pstate_check_t * const 
 kprobe_condition_checks[16];
  enum kprobe_insn __kprobes
  arm_kprobe_decode_insn(kprobe_opcode_t insn, struct arch_specific_insn 
 *asi);

 +#define SAVE_REGS_STRING\
 +  stp x0, x1, [sp, #16 * 0]\n\
 +  stp x2, x3, [sp, #16 * 1]\n\
 +  stp x4, x5, [sp, #16 * 2]\n\
 +  stp x6, x7, [sp, #16 * 3]\n\
 +  stp x8, x9, [sp, #16 * 4]\n\
 +  stp x10, x11, [sp, #16 * 5]\n  \
 +  stp x12, x13, [sp, #16 * 6]\n  \
 +  stp x14, x15, [sp, #16 * 7]\n  \
 +  stp x16, x17, [sp, #16 * 8]\n  \
 +  stp x18, x19, [sp, #16 * 9]\n  \
 +  stp x20, x21, [sp, #16 * 10]\n \
 +  stp x22, x23, [sp, #16 * 11]\n \
 +  stp x24, x25, [sp, #16 * 12]\n \
 +  stp x26, x27, [sp, #16 * 13]\n \
 +  stp x28, x29, [sp, #16 * 14]\n \
 +  str x30,   [sp, #16 * 15]\n\
 +  mrs x0, nzcv\n \
 +  str x0, [sp, #8 * 33]\n
 +
 +
 +#define RESTORE_REGS_STRING\
 +  ldr x0, [sp, #8 * 33]\n\
 +  msr nzcv, x0\n \
 +  ldp x0, x1, [sp, #16 * 0]\n\
 +  ldp x2, x3, [sp, #16 * 1]\n\
 +  ldp x4, x5, [sp, #16 * 2]\n\
 +  ldp x6, x7, [sp, #16 * 3]\n\
 +  ldp x8, x9, [sp, #16 * 4]\n\
 +  ldp x10, x11, [sp, #16 * 5]\n  \
 +  ldp x12, x13, [sp, #16 * 6]\n  \
 +  ldp x14, x15, [sp, #16 * 7]\n  \
 +  ldp x16, x17, [sp, #16 * 8]\n  \
 +  ldp x18, x19, [sp, #16 * 9]\n  \
 +  ldp x20, x21, [sp, #16 * 10]\n \
 +  ldp x22, x23, [sp, #16 * 11]\n \
 +  ldp x24, x25, [sp, #16 * 12]\n \
 +  ldp x26, x27, [sp, #16 * 13]\n \
 +  ldp x28, x29, [sp, #16 * 14]\n \
 +  ldr x30,   [sp, #16 * 15]\n

 Do we need to restore x19..x28 as they are callee-saved?

 Hi Steve,

 The goal was to make the trampoline not affect the values in any of the 
 registers, so if the calling conventions ever change the code will still 
 work. Figured it was safer and clearer just to save everything rather than 
 assuming that the compiler's code generated for trampoline_probe_handler is 
 going to save certain registers.


 Okay this all matches up with the definitions of the pt_regs struct.
 So regs-regs[xn] are all set as is regs-pstate.

 The hard coded constant offsets make me nervous though, as does the
 uncertain state of the other elements of the pt_regs struct.

 The macros in this patch are modelled after the kernel_entry and kernel_exit 
 macros in arch/arm64/kernel/entry.S.  What other elements of the pt_regs 
 struct are of concern? The sp value will be unchanged and the pc value is 
 going to be overwritten in the handler.  Concerned about some portion of the 
 pstate (#8 *33) not be saved/restored?

pstate looks good to me. I was just worried that sp, pc orig_x0 and
syscallno may have uncertain values as their containing structure is
hosted on the stack. It's probably me just being overly paranoid

Re: [PATCH v7 5/7] arm64: Add trampoline code for kretprobes

2015-06-29 Thread Steve Capper
On 15 June 2015 at 20:07, David Long  wrote:
> From: William Cohen 
>
> The trampoline code is used by kretprobes to capture a return from a probed
> function.  This is done by saving the registers, calling the handler, and
> restoring the registers.  The code then returns to the roginal saved caller
> return address.  It is necessary to do this directly instead of using a
> software breakpoint because the code used in processing that breakpoint
> could itself be kprobe'd and cause a problematic reentry into the debug
> exception handler.
>
> Signed-off-by: William Cohen 
> Signed-off-by: David A. Long 
> ---
>  arch/arm64/include/asm/kprobes.h  |  1 +
>  arch/arm64/kernel/kprobes-arm64.h | 41 
> +++
>  arch/arm64/kernel/kprobes.c   | 26 +
>  3 files changed, 68 insertions(+)
>
> diff --git a/arch/arm64/include/asm/kprobes.h 
> b/arch/arm64/include/asm/kprobes.h
> index af31c4d..d081f49 100644
> --- a/arch/arm64/include/asm/kprobes.h
> +++ b/arch/arm64/include/asm/kprobes.h
> @@ -58,5 +58,6 @@ int kprobe_exceptions_notify(struct notifier_block *self,
>  unsigned long val, void *data);
>  int kprobe_breakpoint_handler(struct pt_regs *regs, unsigned int esr);
>  int kprobe_single_step_handler(struct pt_regs *regs, unsigned int esr);
> +void kretprobe_trampoline(void);
>
>  #endif /* _ARM_KPROBES_H */
> diff --git a/arch/arm64/kernel/kprobes-arm64.h 
> b/arch/arm64/kernel/kprobes-arm64.h
> index ff8a55f..bdcfa62 100644
> --- a/arch/arm64/kernel/kprobes-arm64.h
> +++ b/arch/arm64/kernel/kprobes-arm64.h
> @@ -27,4 +27,45 @@ extern kprobes_pstate_check_t * const 
> kprobe_condition_checks[16];
>  enum kprobe_insn __kprobes
>  arm_kprobe_decode_insn(kprobe_opcode_t insn, struct arch_specific_insn *asi);
>
> +#define SAVE_REGS_STRING\
> +   "   stp x0, x1, [sp, #16 * 0]\n"\
> +   "   stp x2, x3, [sp, #16 * 1]\n"\
> +   "   stp x4, x5, [sp, #16 * 2]\n"\
> +   "   stp x6, x7, [sp, #16 * 3]\n"\
> +   "   stp x8, x9, [sp, #16 * 4]\n"\
> +   "   stp x10, x11, [sp, #16 * 5]\n"  \
> +   "   stp x12, x13, [sp, #16 * 6]\n"  \
> +   "   stp x14, x15, [sp, #16 * 7]\n"  \
> +   "   stp x16, x17, [sp, #16 * 8]\n"  \
> +   "   stp x18, x19, [sp, #16 * 9]\n"  \
> +   "   stp x20, x21, [sp, #16 * 10]\n" \
> +   "   stp x22, x23, [sp, #16 * 11]\n" \
> +   "   stp x24, x25, [sp, #16 * 12]\n" \
> +   "   stp x26, x27, [sp, #16 * 13]\n" \
> +   "   stp x28, x29, [sp, #16 * 14]\n" \
> +   "   str x30,   [sp, #16 * 15]\n"\
> +   "   mrs x0, nzcv\n" \
> +   "   str x0, [sp, #8 * 33]\n"
> +
> +
> +#define RESTORE_REGS_STRING\
> +   "   ldr x0, [sp, #8 * 33]\n"\
> +   "   msr nzcv, x0\n" \
> +   "   ldp x0, x1, [sp, #16 * 0]\n"\
> +   "   ldp x2, x3, [sp, #16 * 1]\n"\
> +   "   ldp x4, x5, [sp, #16 * 2]\n"\
> +   "   ldp x6, x7, [sp, #16 * 3]\n"\
> +   "   ldp x8, x9, [sp, #16 * 4]\n"\
> +   "   ldp x10, x11, [sp, #16 * 5]\n"  \
> +   "   ldp x12, x13, [sp, #16 * 6]\n"  \
> +   "   ldp x14, x15, [sp, #16 * 7]\n"  \
> +   "   ldp x16, x17, [sp, #16 * 8]\n"  \
> +   "   ldp x18, x19, [sp, #16 * 9]\n"  \
> +   "   ldp x20, x21, [sp, #16 * 10]\n" \
> +   "   ldp x22, x23, [sp, #16 * 11]\n" \
> +   "   ldp x24, x25, [sp, #16 * 12]\n" \
> +   "   ldp x26, x27, [sp, #16 * 13]\n" \
> +   "   ldp x28, x29, [sp, #16 * 14]\n" \
> +   "   ldr x30,   [sp, #16 * 15]\n"

Do we need to restore x19..x28 as they are callee-saved?

Okay this all matches up with the definitions of the pt_regs struct.
So regs->regs[xn] are all set as is regs->pstate.

The hard coded constant offsets make me nervous though, as does the
uncertain state of the other elements of the pt_regs struct.

> +
>  #endif /* _ARM_KERNEL_KPROBES_ARM64_H */
> diff --git a/arch/arm64/kernel/kprobes.c b/arch/arm64/kernel/kprobes.c
> index 6255814..570218c 100644
> --- a/arch/arm64/kernel/kprobes.c
> +++ b/arch/arm64/kernel/kprobes.c
> @@ -560,6 +560,32 @@ int __kprobes longjmp_break_handler(struct kprobe *p, 
> struct pt_regs *regs)
> return 0;
>  }
>
> +/*
> + * When a retprobed function returns, this code saves registers and
> + * calls trampoline_handler() runs, which calls the kretprobe's handler.
> + */
> +static void __used __kprobes kretprobe_trampoline_holder(void)
> +{
> +   asm volatile (".global kretprobe_trampoline\n"
> +   "kretprobe_trampoline:\n"
> +   "sub sp, sp, %0\n"
> +   SAVE_REGS_STRING
> +   "mov x0, sp\n"
> +   "bl trampoline_probe_handler\n"
> +   /* Replace trampoline address in lr 

Re: [PATCH v7 4/7] arm64: kprobes instruction simulation support

2015-06-29 Thread Steve Capper
On 15 June 2015 at 20:07, David Long  wrote:
> From: Sandeepa Prabhu 
>
> Kprobes needs simulation of instructions that cannot be stepped
> from different memory location, e.g.: those instructions
> that uses PC-relative addressing. In simulation, the behaviour
> of the instruction is implemented using a copy of pt_regs.
>
> Following instruction catagories are simulated:
>  - All branching instructions(conditional, register, and immediate)
>  - Literal access instructions(load-literal, adr/adrp)
>
> Conditional execution is limited to branching instructions in
> ARM v8. If conditions at PSTATE do not match the condition fields
> of opcode, the instruction is effectively NOP. Kprobes considers
> this case as 'miss'.
>
> Thanks to Will Cohen for assorted suggested changes.
>
> Signed-off-by: Sandeepa Prabhu 
> Signed-off-by: William Cohen 
> Signed-off-by: David A. Long 
> ---
>  arch/arm64/kernel/Makefile   |   4 +-
>  arch/arm64/kernel/kprobes-arm64.c|  98 +
>  arch/arm64/kernel/kprobes-arm64.h|   2 +
>  arch/arm64/kernel/kprobes.c  |  35 ++-
>  arch/arm64/kernel/probes-condn-check.c   | 122 ++
>  arch/arm64/kernel/probes-simulate-insn.c | 174 
> +++
>  arch/arm64/kernel/probes-simulate-insn.h |  33 ++
>  7 files changed, 464 insertions(+), 4 deletions(-)
>  create mode 100644 arch/arm64/kernel/probes-condn-check.c
>  create mode 100644 arch/arm64/kernel/probes-simulate-insn.c
>  create mode 100644 arch/arm64/kernel/probes-simulate-insn.h
>
> diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> index 1319872..5e9d54f 100644
> --- a/arch/arm64/kernel/Makefile
> +++ b/arch/arm64/kernel/Makefile
> @@ -32,7 +32,9 @@ arm64-obj-$(CONFIG_CPU_PM)+= sleep.o suspend.o
>  arm64-obj-$(CONFIG_CPU_IDLE)   += cpuidle.o
>  arm64-obj-$(CONFIG_JUMP_LABEL) += jump_label.o
>  arm64-obj-$(CONFIG_KGDB)   += kgdb.o
> -arm64-obj-$(CONFIG_KPROBES)+= kprobes.o kprobes-arm64.o
> +arm64-obj-$(CONFIG_KPROBES)+= kprobes.o kprobes-arm64.o  
>   \
> +  probes-simulate-insn.o 
>   \
> +  probes-condn-check.o
>  arm64-obj-$(CONFIG_EFI)+= efi.o efi-stub.o 
> efi-entry.o
>  arm64-obj-$(CONFIG_PCI)+= pci.o
>  arm64-obj-$(CONFIG_ARMV8_DEPRECATED)   += armv8_deprecated.o
> diff --git a/arch/arm64/kernel/kprobes-arm64.c 
> b/arch/arm64/kernel/kprobes-arm64.c
> index f958c52..8a7e6b0 100644
> --- a/arch/arm64/kernel/kprobes-arm64.c
> +++ b/arch/arm64/kernel/kprobes-arm64.c
> @@ -20,6 +20,76 @@
>  #include 
>
>  #include "kprobes-arm64.h"
> +#include "probes-simulate-insn.h"
> +
> +/*
> + * condition check functions for kprobes simulation
> + */
> +static unsigned long __kprobes
> +__check_pstate(struct kprobe *p, struct pt_regs *regs)
> +{
> +   struct arch_specific_insn *asi = >ainsn;
> +   unsigned long pstate = regs->pstate & 0x;
> +
> +   return asi->pstate_cc(pstate);
> +}
> +
> +static unsigned long __kprobes
> +__check_cbz(struct kprobe *p, struct pt_regs *regs)
> +{
> +   return check_cbz((u32)p->opcode, regs);
> +}
> +
> +static unsigned long __kprobes
> +__check_cbnz(struct kprobe *p, struct pt_regs *regs)
> +{
> +   return check_cbnz((u32)p->opcode, regs);
> +}
> +
> +static unsigned long __kprobes
> +__check_tbz(struct kprobe *p, struct pt_regs *regs)
> +{
> +   return check_tbz((u32)p->opcode, regs);
> +}
> +
> +static unsigned long __kprobes
> +__check_tbnz(struct kprobe *p, struct pt_regs *regs)
> +{
> +   return check_tbnz((u32)p->opcode, regs);
> +}
> +
> +/*
> + * prepare functions for instruction simulation
> + */
> +static void __kprobes
> +prepare_none(struct kprobe *p, struct arch_specific_insn *asi)
> +{
> +}
> +
> +static void __kprobes
> +prepare_bcond(struct kprobe *p, struct arch_specific_insn *asi)
> +{
> +   kprobe_opcode_t insn = p->opcode;
> +
> +   asi->check_condn = __check_pstate;
> +   asi->pstate_cc = kprobe_condition_checks[insn & 0xf];
> +}
> +
> +static void __kprobes
> +prepare_cbz_cbnz(struct kprobe *p, struct arch_specific_insn *asi)
> +{
> +   kprobe_opcode_t insn = p->opcode;
> +
> +   asi->check_condn = (insn & (1 << 24)) ? __check_cbnz : __check_cbz;
> +}
> +
> +static void __kprobes
> +prepare_tbz_tbnz(struct kprobe *p, struct arch_specific_insn *asi)
> +{
> +   kprobe_opcode_t insn = p->opcode;
> +
> +   asi->check_condn = (insn & (1 << 24)) ? __check_tbnz : __check_tbz;
> +}
>
>  static bool __kprobes aarch64_insn_is_steppable(u32 insn)
>  {
> @@ -63,6 +133,34 @@ arm_kprobe_decode_insn(kprobe_opcode_t insn, struct 
> arch_specific_insn *asi)
>  */
> if (aarch64_insn_is_steppable(insn))
> return INSN_GOOD;
> +
> +   asi->prepare = prepare_none;
> +
> +   if 

Re: [PATCH v7 2/7] arm64: Add more test functions to insn.c

2015-06-29 Thread Steve Capper
Hi David,
Some comments below.

On 15 June 2015 at 20:07, David Long  wrote:
> From: "David A. Long" 
>
> Certain instructions are hard to execute correctly out-of-line (as in
> kprobes).  Test functions are added to insn.[hc] to identify these.  The
> instructions include any that use PC-relative addressing, change the PC,
> or change interrupt masking. For efficiency and simplicity test
> functions are also added for small collections of related instructions.
>
> Signed-off-by: David A. Long 
> ---
>  arch/arm64/include/asm/insn.h | 18 ++
>  arch/arm64/kernel/insn.c  | 28 
>  2 files changed, 46 insertions(+)
>
> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
> index f81b328..1fdd237 100644
> --- a/arch/arm64/include/asm/insn.h
> +++ b/arch/arm64/include/asm/insn.h
> @@ -223,8 +223,13 @@ static __always_inline bool aarch64_insn_is_##abbr(u32 
> code) \
>  static __always_inline u32 aarch64_insn_get_##abbr##_value(void) \
>  { return (val); }
>
> +__AARCH64_INSN_FUNCS(adr_adrp, 0x1F00, 0x1000)
> +__AARCH64_INSN_FUNCS(prfm_lit, 0xFF00, 0xD800)
>  __AARCH64_INSN_FUNCS(str_reg,  0x3FE0EC00, 0x38206800)
>  __AARCH64_INSN_FUNCS(ldr_reg,  0x3FE0EC00, 0x38606800)
> +__AARCH64_INSN_FUNCS(ldr_lit,  0xBF00, 0x1800)
> +__AARCH64_INSN_FUNCS(ldrsw_lit,0xFF00, 0x9800)
> +__AARCH64_INSN_FUNCS(exclusive,0x3F00, 0x0800)

Going one step back, if we're worried about the exclusive monitors
then we'll be worried about instructions in-between the monitor pairs
too?


>  __AARCH64_INSN_FUNCS(stp_post, 0x7FC0, 0x2880)
>  __AARCH64_INSN_FUNCS(ldp_post, 0x7FC0, 0x28C0)
>  __AARCH64_INSN_FUNCS(stp_pre,  0x7FC0, 0x2980)
> @@ -264,19 +269,29 @@ __AARCH64_INSN_FUNCS(ands,0x7F20, 
> 0x6A00)
>  __AARCH64_INSN_FUNCS(bics, 0x7F20, 0x6A20)
>  __AARCH64_INSN_FUNCS(b,0xFC00, 0x1400)
>  __AARCH64_INSN_FUNCS(bl,   0xFC00, 0x9400)
> +__AARCH64_INSN_FUNCS(b_bl, 0x7C00, 0x1400)
> +__AARCH64_INSN_FUNCS(cb,   0x7E00, 0x3400)
>  __AARCH64_INSN_FUNCS(cbz,  0x7F00, 0x3400)
>  __AARCH64_INSN_FUNCS(cbnz, 0x7F00, 0x3500)
> +__AARCH64_INSN_FUNCS(tb,   0x7E00, 0x3600)
>  __AARCH64_INSN_FUNCS(tbz,  0x7F00, 0x3600)
>  __AARCH64_INSN_FUNCS(tbnz, 0x7F00, 0x3700)
> +__AARCH64_INSN_FUNCS(b_bl_cb_tb, 0x5C00, 0x1400)
>  __AARCH64_INSN_FUNCS(bcond,0xFF10, 0x5400)
>  __AARCH64_INSN_FUNCS(svc,  0xFFE0001F, 0xD401)
>  __AARCH64_INSN_FUNCS(hvc,  0xFFE0001F, 0xD402)
>  __AARCH64_INSN_FUNCS(smc,  0xFFE0001F, 0xD403)
>  __AARCH64_INSN_FUNCS(brk,  0xFFE0001F, 0xD420)
> +__AARCH64_INSN_FUNCS(exception,0xFF00, 0xD400)
>  __AARCH64_INSN_FUNCS(hint, 0xF01F, 0xD503201F)
>  __AARCH64_INSN_FUNCS(br,   0xFC1F, 0xD61F)
>  __AARCH64_INSN_FUNCS(blr,  0xFC1F, 0xD63F)
> +__AARCH64_INSN_FUNCS(br_blr,   0xFFDFFC1F, 0xD61F)
>  __AARCH64_INSN_FUNCS(ret,  0xFC1F, 0xD65F)
> +__AARCH64_INSN_FUNCS(msr_imm,  0xFFF8F000, 0xD5004000)

Should this not be:
__AARCH64_INSN_FUNCS(msr_imm,  0xFFF8F01F, 0xD500401F)
As the lower 5 bits of an MSR (immediate) are all 1?

> +__AARCH64_INSN_FUNCS(msr_reg,  0xFFF0, 0xD510)
> +__AARCH64_INSN_FUNCS(set_clr_daif, 0xF0DF, 0xD50340DF)

Looks good, just an MSR immediate with either DAIFSet or DAIFClr.

> +__AARCH64_INSN_FUNCS(rd_wr_daif, 0xFFDFFFE0, 0xD51B4220)

Looks good, either MRS or MSR (register) where systemreg = DAIF.

>
>  #undef __AARCH64_INSN_FUNCS
>
> @@ -285,6 +300,9 @@ bool aarch64_insn_is_nop(u32 insn);
>  int aarch64_insn_read(void *addr, u32 *insnp);
>  int aarch64_insn_write(void *addr, u32 insn);
>  enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
> +bool aarch64_insn_uses_literal(u32 insn);
> +bool aarch64_insn_is_branch(u32 insn);
> +bool aarch64_insn_is_daif_access(u32 insn);
>  u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
>  u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
>   u32 insn, u64 imm);
> diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
> index 9249020..ecd8882 100644
> --- a/arch/arm64/kernel/insn.c
> +++ b/arch/arm64/kernel/insn.c
> @@ -155,6 +155,34 @@ static bool __kprobes __aarch64_insn_hotpatch_safe(u32 
> insn)
> aarch64_insn_is_nop(insn);
>  }
>
> +bool __kprobes aarch64_insn_uses_literal(u32 insn)
> +{
> +   /* ldr/ldrsw (literal), prfm */
> +
> +   return aarch64_insn_is_ldr_lit(insn) ||
> +   aarch64_insn_is_ldrsw_lit(insn) ||
> +   aarch64_insn_is_adr_adrp(insn) ||
> +   aarch64_insn_is_prfm_lit(insn);
> +}
> +
> +bool __kprobes aarch64_insn_is_branch(u32 insn)
> +{
> +   /* b, bl, cb*, tb*, b.cond, br, blr */
> +
> +  

Re: [PATCH v7 3/7] arm64: Kprobes with single stepping support

2015-06-29 Thread Steve Capper
On 15 June 2015 at 20:07, David Long  wrote:
> From: Sandeepa Prabhu 
>
> Add support for basic kernel probes(kprobes) and jump probes
> (jprobes) for ARM64.
>
> Kprobes utilizes software breakpoint and single step debug
> exceptions supported on ARM v8.
>
> A software breakpoint is placed at the probe address to trap the
> kernel execution into the kprobe handler.
>
> ARM v8 supports enabling single stepping before the break exception
> return (ERET), with next PC in exception return address (ELR_EL1). The
> kprobe handler prepares an executable memory slot for out-of-line
> execution with a copy of the original instruction being probed, and
> enables single stepping. The PC is set to the out-of-line slot address
> before the ERET. With this scheme, the instruction is executed with the
> exact same register context except for the PC (and DAIF) registers.
>
> Debug mask (PSTATE.D) is enabled only when single stepping a recursive
> kprobe, e.g.: during kprobes reenter so that probed instruction can be
> single stepped within the kprobe handler -exception- context.
> The recursion depth of kprobe is always 2, i.e. upon probe re-entry,
> any further re-entry is prevented by not calling handlers and the case
> counted as a missed kprobe).
>
> Single stepping from the x-o-l slot has a drawback for PC-relative accesses
> like branching and symbolic literals access as the offset from the new PC
> (slot address) may not be ensured to fit in the immediate value of
> the opcode. Such instructions need simulation, so reject
> probing them.
>
> Instructions generating exceptions or cpu mode change are rejected
> for probing.
>
> Instructions using Exclusive Monitor are rejected too.
>
> System instructions are mostly enabled for stepping, except MSR/MRS
> accesses to "DAIF" flags in PSTATE, which are not safe for
> probing.
>
> Thanks to Steve Capper and Pratyush Anand for several suggested
> Changes.
>
> Signed-off-by: Sandeepa Prabhu 
> Signed-off-by: Steve Capper 
> Signed-off-by: David A. Long 
> ---
>  arch/arm64/Kconfig  |   1 +
>  arch/arm64/include/asm/debug-monitors.h |   5 +
>  arch/arm64/include/asm/kprobes.h|  62 
>  arch/arm64/include/asm/probes.h |  50 +++
>  arch/arm64/include/asm/ptrace.h |   3 +-
>  arch/arm64/kernel/Makefile  |   1 +
>  arch/arm64/kernel/debug-monitors.c  |  35 ++-
>  arch/arm64/kernel/kprobes-arm64.c   |  68 
>  arch/arm64/kernel/kprobes-arm64.h   |  28 ++
>  arch/arm64/kernel/kprobes.c | 537 
> 
>  arch/arm64/kernel/kprobes.h |  24 ++
>  arch/arm64/kernel/vmlinux.lds.S |   1 +
>  arch/arm64/mm/fault.c   |  25 ++
>  13 files changed, 829 insertions(+), 11 deletions(-)
>  create mode 100644 arch/arm64/include/asm/kprobes.h
>  create mode 100644 arch/arm64/include/asm/probes.h
>  create mode 100644 arch/arm64/kernel/kprobes-arm64.c
>  create mode 100644 arch/arm64/kernel/kprobes-arm64.h
>  create mode 100644 arch/arm64/kernel/kprobes.c
>  create mode 100644 arch/arm64/kernel/kprobes.h
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 966091f..45a9bd81 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -71,6 +71,7 @@ config ARM64
> select HAVE_REGS_AND_STACK_ACCESS_API
> select HAVE_RCU_TABLE_FREE
> select HAVE_SYSCALL_TRACEPOINTS
> +   select HAVE_KPROBES
> select IRQ_DOMAIN
> select MODULES_USE_ELF_RELA
> select NO_BOOTMEM
> diff --git a/arch/arm64/include/asm/debug-monitors.h 
> b/arch/arm64/include/asm/debug-monitors.h
> index 40ec68a..92d7cea 100644
> --- a/arch/arm64/include/asm/debug-monitors.h
> +++ b/arch/arm64/include/asm/debug-monitors.h
> @@ -90,6 +90,11 @@
>
>  #define CACHE_FLUSH_IS_SAFE1
>
> +/* kprobes BRK opcodes with ESR encoding  */
> +#define BRK64_ESR_MASK 0x
> +#define BRK64_ESR_KPROBES  0x0004
> +#define BRK64_OPCODE_KPROBES   (AARCH64_BREAK_MON | (BRK64_ESR_KPROBES << 5))
> +
>  /* AArch32 */
>  #define DBG_ESR_EVT_BKPT   0x4
>  #define DBG_ESR_EVT_VECC   0x5
> diff --git a/arch/arm64/include/asm/kprobes.h 
> b/arch/arm64/include/asm/kprobes.h
> new file mode 100644
> index 000..af31c4d
> --- /dev/null
> +++ b/arch/arm64/include/asm/kprobes.h
> @@ -0,0 +1,62 @@
> +/*
> + * arch/arm64/include/asm/kprobes.h
> + *
> + * Copyright (C) 2013 Linaro Limited
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Softw

Re: [PATCH v7 0/7] arm64: Add kernel probes (kprobes) support

2015-06-29 Thread Steve Capper
On 15 June 2015 at 20:07, David Long  wrote:
> From: "David A. Long" 
>
> This patchset is heavily based on Sandeepa Prabhu's ARM v8 kprobes patches,
> first seen in October 2013. This version attempts to address concerns raised 
> by
> reviewers and also fixes problems discovered during testing.
>
> This patchset adds support for kernel probes(kprobes), jump probes(jprobes)
> and return probes(kretprobes) support for ARM64.
>
> The kprobes mechanism makes use of software breakpoint and single stepping
> support available in the ARM v8 kernel.
>

Hi David,
Thanks for this, and apologies for getting to this late...
I've had a good read through the patches in this series, and have some comments.

Cheers,
--
Steve

> The is patch depends on:
> [PATCH 1/2] Move the pt_regs_offset struct definition from arch to 
> common include file
> [PATCH 2/2] Consolidate redundant register/stack access code
>
> Changes since v2 include:
>
> 1) Removal of NOP padding in kprobe XOL slots. Slots are now exactly one
> instruction long.
> 2) Disabling of interrupts during execution in single-step mode.
> 3) Fixing of numerous problems in instruction simulation code (mostly
> thanks to Will Cohen).
> 4) Support for the HAVE_REGS_AND_STACK_ACCESS_API feature is added, to allow
> access to kprobes through debugfs.
> 5) kprobes is *not* enabled in defconfig.
> 6) Numerous complaints from checkpatch have been cleaned up, although a couple
> remain as removing the function pointer typedefs results in ugly code.
>
> Changes since v3 include:
>
> 1) Remove table-driven instruction parsing and replace with an if statement
> calling out to old and new instruction test functions in insn.c.
> 2) I removed the addition of orig_x0 to ptrace.h.
> 3) Reorder the patches.
> 4) Replace the previous interrupt disabling (from Will Cohen) with
> an improved solution (from Steve Capper).
>
> Changes since v4 include:
>
> 1) Added insn.c functions to detect exception instructions and DAIF
>read/write instructions, and use them to reject probing same.
> 2) Changed adr detect function to also recognize adrp. Reject both.
> 3) Added missing __kprobes for some new functions.
> 4) Added call to kprobes_fault_handler from mm do_page_fault.
> 5) Reject all non-simulated branch/ret instructions, not just those
>that use an immediate offset.
> 6) Moved software breakpoint definitions into debug-monitors.h.
> 7) Removed "!XIP_KERNEL" from Kconfig.
> 8) changed kprobes_condition_check_t and kprobes_prepare_t to probes_*,
>for future sharing with uprobes.
> 9) Removed bogus call to kprobes_restore_local_irqflag() from
>trampoline_probe_handler().
>
> Changes since v5 include:
>
> 1) Replaced installation of breakpoint hook with direct call from the
> handlers in debug-monitors.c, as requested.
> 2) Reject probing of instructions that read the interrupt mask, in
> addition to instructions that set it.
> 3) Cleaned up comments describing usage of Debug Mask.
> 4) Added KPROBE_REENTER case in reenter_kprobe.
> 5) Corrected the ifdef'd definitions for notify_page_fault() to be
> consistent when KPROBES is not configed.
> 6) Changed "cpsr" to "pstate" for HAVE_REGS_AND_STACK_ACCESS_API feature.
> 7) Added back in missing new files in previous patch.
> 8) Changed two instances of pr_warning() to pr_warn().
>
> Note that there seems to be at least a potential issue with kprobes
> on multiple (possibly all) platforms having to do with use of kfree
> inside of the kretprobes trampoline handler.  This has manifested
> occasionally in systemtap testing on arm64.  There does not appear to
> be an simple solution to the problem.
>
> Changes since v6 include:
>
> 1) New trampoline code from Will Cohen fixes the occasional failure seen
> when processing kretprobes by replacing the software breakpoint with
> assembly code to implement the return to the original execution stream.
> 2) Changed ip0, ip1, fp, and lr to plain numbered registers for purposes
> of recognizing them as an ascii string in the stack/reg access code.
> 3) Removed orig_x0.
> 4) Moved ARM_x* defines from arch/arm64/include/uapi/asm/ptrace.h to
> arch/arm64/kernel/ptrace.c.
>
> David A. Long (2):
>   arm64: Add HAVE_REGS_AND_STACK_ACCESS_API feature
>   arm64: Add more test functions to insn.c
>
> Sandeepa Prabhu (4):
>   arm64: Kprobes with single stepping support
>   arm64: kprobes instruction simulation support
>   arm64: Add kernel return probes support (kretprobes)
>   kprobes: Add arm64 case in kprobe example module
>
> William Cohen (1):
>   arm64: Add trampoline code for kretprobes
>
>  arch/arm64/Kconfig   | 

Re: [PATCH v7 1/7] arm64: Add HAVE_REGS_AND_STACK_ACCESS_API feature

2015-06-29 Thread Steve Capper
On 15 June 2015 at 20:07, David Long  wrote:
> From: "David A. Long" 
>
> Add HAVE_REGS_AND_STACK_ACCESS_API feature for arm64.
>
> Signed-off-by: David A. Long 
> ---
>  arch/arm64/Kconfig  |  1 +
>  arch/arm64/include/asm/ptrace.h | 25 +
>  arch/arm64/kernel/ptrace.c  | 77 
> +
>  3 files changed, 103 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 7796af4..966091f 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -68,6 +68,7 @@ config ARM64
> select HAVE_PERF_EVENTS
> select HAVE_PERF_REGS
> select HAVE_PERF_USER_STACK_DUMP
> +   select HAVE_REGS_AND_STACK_ACCESS_API
> select HAVE_RCU_TABLE_FREE
> select HAVE_SYSCALL_TRACEPOINTS
> select IRQ_DOMAIN
> diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
> index d6dd9fd..8f440e9 100644
> --- a/arch/arm64/include/asm/ptrace.h
> +++ b/arch/arm64/include/asm/ptrace.h
> @@ -118,6 +118,8 @@ struct pt_regs {
> u64 syscallno;
>  };
>
> +#define MAX_REG_OFFSET (sizeof(struct user_pt_regs) - sizeof(u64))
> +
>  #define arch_has_single_step() (1)
>
>  #ifdef CONFIG_COMPAT
> @@ -146,6 +148,29 @@ struct pt_regs {
>  #define user_stack_pointer(regs) \
> (!compat_user_mode(regs) ? (regs)->sp : (regs)->compat_sp)
>
> +/**
> + * regs_get_register() - get register value from its offset
> + * @regs: pt_regs from which register value is gotten
> + * @offset:offset number of the register.
> + *
> + * regs_get_register returns the value of a register whose offset from @regs.
> + * The @offset is the offset of the register in struct pt_regs.
> + * If @offset is bigger than MAX_REG_OFFSET, this returns 0.
> + */
> +static inline u64 regs_get_register(struct pt_regs *regs,
> + unsigned int offset)
> +{
> +   if (unlikely(offset > MAX_REG_OFFSET))
> +   return 0;
> +   return *(u64 *)((u64)regs + offset);

Why not:
return regs->regs[offset];

> +}
> +
> +/* Valid only for Kernel mode traps. */
> +static inline unsigned long kernel_stack_pointer(struct pt_regs *regs)
> +{
> +   return regs->sp;
> +}
> +
>  static inline unsigned long regs_return_value(struct pt_regs *regs)
>  {
> return regs->regs[0];
> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> index d882b83..f6199a5 100644
> --- a/arch/arm64/kernel/ptrace.c
> +++ b/arch/arm64/kernel/ptrace.c
> @@ -48,6 +48,83 @@
>  #define CREATE_TRACE_POINTS
>  #include 
>
> +#define ARM_pstate pstate
> +#define ARM_pc pc
> +#define ARM_sp sp
> +#define ARM_x30regs[30]
> +#define ARM_x29regs[29]
> +#define ARM_x28regs[28]
> +#define ARM_x27regs[27]
> +#define ARM_x26regs[26]
> +#define ARM_x25regs[25]
> +#define ARM_x24regs[24]
> +#define ARM_x23regs[23]
> +#define ARM_x22regs[22]
> +#define ARM_x21regs[21]
> +#define ARM_x20regs[20]
> +#define ARM_x19regs[19]
> +#define ARM_x18regs[18]
> +#define ARM_x17regs[17]
> +#define ARM_x16regs[16]
> +#define ARM_x15regs[15]
> +#define ARM_x14regs[14]
> +#define ARM_x13regs[13]
> +#define ARM_x12regs[12]
> +#define ARM_x11regs[11]
> +#define ARM_x10regs[10]
> +#define ARM_x9 regs[9]
> +#define ARM_x8 regs[8]
> +#define ARM_x7 regs[7]
> +#define ARM_x6 regs[6]
> +#define ARM_x5 regs[5]
> +#define ARM_x4 regs[4]
> +#define ARM_x3 regs[3]
> +#define ARM_x2 regs[2]
> +#define ARM_x1 regs[1]
> +#define ARM_x0 regs[0]
> +
> +#define REG_OFFSET_NAME(r) \
> +   {.name = #r, .offset = offsetof(struct pt_regs, ARM_##r)}
> +#define REG_OFFSET_END {.name = NULL, .offset = 0}
> +
> +const struct pt_regs_offset regs_offset_table[] = {
> +   REG_OFFSET_NAME(x0),
> +   REG_OFFSET_NAME(x1),
> +   REG_OFFSET_NAME(x2),
> +   REG_OFFSET_NAME(x3),
> +   REG_OFFSET_NAME(x4),
> +   REG_OFFSET_NAME(x5),
> +   REG_OFFSET_NAME(x6),
> +   REG_OFFSET_NAME(x7),
> +   REG_OFFSET_NAME(x8),
> +   REG_OFFSET_NAME(x9),
> +   REG_OFFSET_NAME(x10),
> +   REG_OFFSET_NAME(x11),
> +   REG_OFFSET_NAME(x12),
> +   REG_OFFSET_NAME(x13),
> +   REG_OFFSET_NAME(x14),
> +   REG_OFFSET_NAME(x15),
> +   REG_OFFSET_NAME(x16),
> +   REG_OFFSET_NAME(x17),
> +   REG_OFFSET_NAME(x18),
> +   REG_OFFSET_NAME(x19),
> +   REG_OFFSET_NAME(x20),
> +   REG_OFFSET_NAME(x21),
> +   REG_OFFSET_NAME(x22),
> +   REG_OFFSET_NAME(x23),
> +   REG_OFFSET_NAME(x24),
> +   REG_OFFSET_NAME(x25),
> +   

Re: [PATCH v7 0/7] arm64: Add kernel probes (kprobes) support

2015-06-29 Thread Steve Capper
On 15 June 2015 at 20:07, David Long dave.l...@linaro.org wrote:
 From: David A. Long dave.l...@linaro.org

 This patchset is heavily based on Sandeepa Prabhu's ARM v8 kprobes patches,
 first seen in October 2013. This version attempts to address concerns raised 
 by
 reviewers and also fixes problems discovered during testing.

 This patchset adds support for kernel probes(kprobes), jump probes(jprobes)
 and return probes(kretprobes) support for ARM64.

 The kprobes mechanism makes use of software breakpoint and single stepping
 support available in the ARM v8 kernel.


Hi David,
Thanks for this, and apologies for getting to this late...
I've had a good read through the patches in this series, and have some comments.

Cheers,
--
Steve

 The is patch depends on:
 [PATCH 1/2] Move the pt_regs_offset struct definition from arch to 
 common include file
 [PATCH 2/2] Consolidate redundant register/stack access code

 Changes since v2 include:

 1) Removal of NOP padding in kprobe XOL slots. Slots are now exactly one
 instruction long.
 2) Disabling of interrupts during execution in single-step mode.
 3) Fixing of numerous problems in instruction simulation code (mostly
 thanks to Will Cohen).
 4) Support for the HAVE_REGS_AND_STACK_ACCESS_API feature is added, to allow
 access to kprobes through debugfs.
 5) kprobes is *not* enabled in defconfig.
 6) Numerous complaints from checkpatch have been cleaned up, although a couple
 remain as removing the function pointer typedefs results in ugly code.

 Changes since v3 include:

 1) Remove table-driven instruction parsing and replace with an if statement
 calling out to old and new instruction test functions in insn.c.
 2) I removed the addition of orig_x0 to ptrace.h.
 3) Reorder the patches.
 4) Replace the previous interrupt disabling (from Will Cohen) with
 an improved solution (from Steve Capper).

 Changes since v4 include:

 1) Added insn.c functions to detect exception instructions and DAIF
read/write instructions, and use them to reject probing same.
 2) Changed adr detect function to also recognize adrp. Reject both.
 3) Added missing __kprobes for some new functions.
 4) Added call to kprobes_fault_handler from mm do_page_fault.
 5) Reject all non-simulated branch/ret instructions, not just those
that use an immediate offset.
 6) Moved software breakpoint definitions into debug-monitors.h.
 7) Removed !XIP_KERNEL from Kconfig.
 8) changed kprobes_condition_check_t and kprobes_prepare_t to probes_*,
for future sharing with uprobes.
 9) Removed bogus call to kprobes_restore_local_irqflag() from
trampoline_probe_handler().

 Changes since v5 include:

 1) Replaced installation of breakpoint hook with direct call from the
 handlers in debug-monitors.c, as requested.
 2) Reject probing of instructions that read the interrupt mask, in
 addition to instructions that set it.
 3) Cleaned up comments describing usage of Debug Mask.
 4) Added KPROBE_REENTER case in reenter_kprobe.
 5) Corrected the ifdef'd definitions for notify_page_fault() to be
 consistent when KPROBES is not configed.
 6) Changed cpsr to pstate for HAVE_REGS_AND_STACK_ACCESS_API feature.
 7) Added back in missing new files in previous patch.
 8) Changed two instances of pr_warning() to pr_warn().

 Note that there seems to be at least a potential issue with kprobes
 on multiple (possibly all) platforms having to do with use of kfree
 inside of the kretprobes trampoline handler.  This has manifested
 occasionally in systemtap testing on arm64.  There does not appear to
 be an simple solution to the problem.

 Changes since v6 include:

 1) New trampoline code from Will Cohen fixes the occasional failure seen
 when processing kretprobes by replacing the software breakpoint with
 assembly code to implement the return to the original execution stream.
 2) Changed ip0, ip1, fp, and lr to plain numbered registers for purposes
 of recognizing them as an ascii string in the stack/reg access code.
 3) Removed orig_x0.
 4) Moved ARM_x* defines from arch/arm64/include/uapi/asm/ptrace.h to
 arch/arm64/kernel/ptrace.c.

 David A. Long (2):
   arm64: Add HAVE_REGS_AND_STACK_ACCESS_API feature
   arm64: Add more test functions to insn.c

 Sandeepa Prabhu (4):
   arm64: Kprobes with single stepping support
   arm64: kprobes instruction simulation support
   arm64: Add kernel return probes support (kretprobes)
   kprobes: Add arm64 case in kprobe example module

 William Cohen (1):
   arm64: Add trampoline code for kretprobes

  arch/arm64/Kconfig   |   3 +
  arch/arm64/include/asm/debug-monitors.h  |   5 +
  arch/arm64/include/asm/insn.h|  18 +
  arch/arm64/include/asm/kprobes.h |  63 +++
  arch/arm64/include/asm/probes.h  |  50 +++
  arch/arm64/include/asm/ptrace.h  |  28 +-
  arch/arm64/kernel/Makefile   |   3 +
  arch/arm64/kernel/debug-monitors.c   |  35 +-
  arch/arm64/kernel/insn.c

Re: [PATCH v7 1/7] arm64: Add HAVE_REGS_AND_STACK_ACCESS_API feature

2015-06-29 Thread Steve Capper
On 15 June 2015 at 20:07, David Long dave.l...@linaro.org wrote:
 From: David A. Long dave.l...@linaro.org

 Add HAVE_REGS_AND_STACK_ACCESS_API feature for arm64.

 Signed-off-by: David A. Long dave.l...@linaro.org
 ---
  arch/arm64/Kconfig  |  1 +
  arch/arm64/include/asm/ptrace.h | 25 +
  arch/arm64/kernel/ptrace.c  | 77 
 +
  3 files changed, 103 insertions(+)

 diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
 index 7796af4..966091f 100644
 --- a/arch/arm64/Kconfig
 +++ b/arch/arm64/Kconfig
 @@ -68,6 +68,7 @@ config ARM64
 select HAVE_PERF_EVENTS
 select HAVE_PERF_REGS
 select HAVE_PERF_USER_STACK_DUMP
 +   select HAVE_REGS_AND_STACK_ACCESS_API
 select HAVE_RCU_TABLE_FREE
 select HAVE_SYSCALL_TRACEPOINTS
 select IRQ_DOMAIN
 diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
 index d6dd9fd..8f440e9 100644
 --- a/arch/arm64/include/asm/ptrace.h
 +++ b/arch/arm64/include/asm/ptrace.h
 @@ -118,6 +118,8 @@ struct pt_regs {
 u64 syscallno;
  };

 +#define MAX_REG_OFFSET (sizeof(struct user_pt_regs) - sizeof(u64))
 +
  #define arch_has_single_step() (1)

  #ifdef CONFIG_COMPAT
 @@ -146,6 +148,29 @@ struct pt_regs {
  #define user_stack_pointer(regs) \
 (!compat_user_mode(regs) ? (regs)-sp : (regs)-compat_sp)

 +/**
 + * regs_get_register() - get register value from its offset
 + * @regs: pt_regs from which register value is gotten
 + * @offset:offset number of the register.
 + *
 + * regs_get_register returns the value of a register whose offset from @regs.
 + * The @offset is the offset of the register in struct pt_regs.
 + * If @offset is bigger than MAX_REG_OFFSET, this returns 0.
 + */
 +static inline u64 regs_get_register(struct pt_regs *regs,
 + unsigned int offset)
 +{
 +   if (unlikely(offset  MAX_REG_OFFSET))
 +   return 0;
 +   return *(u64 *)((u64)regs + offset);

Why not:
return regs-regs[offset];

 +}
 +
 +/* Valid only for Kernel mode traps. */
 +static inline unsigned long kernel_stack_pointer(struct pt_regs *regs)
 +{
 +   return regs-sp;
 +}
 +
  static inline unsigned long regs_return_value(struct pt_regs *regs)
  {
 return regs-regs[0];
 diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
 index d882b83..f6199a5 100644
 --- a/arch/arm64/kernel/ptrace.c
 +++ b/arch/arm64/kernel/ptrace.c
 @@ -48,6 +48,83 @@
  #define CREATE_TRACE_POINTS
  #include trace/events/syscalls.h

 +#define ARM_pstate pstate
 +#define ARM_pc pc
 +#define ARM_sp sp
 +#define ARM_x30regs[30]
 +#define ARM_x29regs[29]
 +#define ARM_x28regs[28]
 +#define ARM_x27regs[27]
 +#define ARM_x26regs[26]
 +#define ARM_x25regs[25]
 +#define ARM_x24regs[24]
 +#define ARM_x23regs[23]
 +#define ARM_x22regs[22]
 +#define ARM_x21regs[21]
 +#define ARM_x20regs[20]
 +#define ARM_x19regs[19]
 +#define ARM_x18regs[18]
 +#define ARM_x17regs[17]
 +#define ARM_x16regs[16]
 +#define ARM_x15regs[15]
 +#define ARM_x14regs[14]
 +#define ARM_x13regs[13]
 +#define ARM_x12regs[12]
 +#define ARM_x11regs[11]
 +#define ARM_x10regs[10]
 +#define ARM_x9 regs[9]
 +#define ARM_x8 regs[8]
 +#define ARM_x7 regs[7]
 +#define ARM_x6 regs[6]
 +#define ARM_x5 regs[5]
 +#define ARM_x4 regs[4]
 +#define ARM_x3 regs[3]
 +#define ARM_x2 regs[2]
 +#define ARM_x1 regs[1]
 +#define ARM_x0 regs[0]
 +
 +#define REG_OFFSET_NAME(r) \
 +   {.name = #r, .offset = offsetof(struct pt_regs, ARM_##r)}
 +#define REG_OFFSET_END {.name = NULL, .offset = 0}
 +
 +const struct pt_regs_offset regs_offset_table[] = {
 +   REG_OFFSET_NAME(x0),
 +   REG_OFFSET_NAME(x1),
 +   REG_OFFSET_NAME(x2),
 +   REG_OFFSET_NAME(x3),
 +   REG_OFFSET_NAME(x4),
 +   REG_OFFSET_NAME(x5),
 +   REG_OFFSET_NAME(x6),
 +   REG_OFFSET_NAME(x7),
 +   REG_OFFSET_NAME(x8),
 +   REG_OFFSET_NAME(x9),
 +   REG_OFFSET_NAME(x10),
 +   REG_OFFSET_NAME(x11),
 +   REG_OFFSET_NAME(x12),
 +   REG_OFFSET_NAME(x13),
 +   REG_OFFSET_NAME(x14),
 +   REG_OFFSET_NAME(x15),
 +   REG_OFFSET_NAME(x16),
 +   REG_OFFSET_NAME(x17),
 +   REG_OFFSET_NAME(x18),
 +   REG_OFFSET_NAME(x19),
 +   REG_OFFSET_NAME(x20),
 +   REG_OFFSET_NAME(x21),
 +   REG_OFFSET_NAME(x22),
 +   REG_OFFSET_NAME(x23),
 +   REG_OFFSET_NAME(x24),
 +   REG_OFFSET_NAME(x25),
 +   REG_OFFSET_NAME(x26),
 +   REG_OFFSET_NAME(x27),
 +   

Re: [PATCH v7 5/7] arm64: Add trampoline code for kretprobes

2015-06-29 Thread Steve Capper
On 15 June 2015 at 20:07, David Long dave.l...@linaro.org wrote:
 From: William Cohen wco...@redhat.com

 The trampoline code is used by kretprobes to capture a return from a probed
 function.  This is done by saving the registers, calling the handler, and
 restoring the registers.  The code then returns to the roginal saved caller
 return address.  It is necessary to do this directly instead of using a
 software breakpoint because the code used in processing that breakpoint
 could itself be kprobe'd and cause a problematic reentry into the debug
 exception handler.

 Signed-off-by: William Cohen wco...@redhat.com
 Signed-off-by: David A. Long dave.l...@linaro.org
 ---
  arch/arm64/include/asm/kprobes.h  |  1 +
  arch/arm64/kernel/kprobes-arm64.h | 41 
 +++
  arch/arm64/kernel/kprobes.c   | 26 +
  3 files changed, 68 insertions(+)

 diff --git a/arch/arm64/include/asm/kprobes.h 
 b/arch/arm64/include/asm/kprobes.h
 index af31c4d..d081f49 100644
 --- a/arch/arm64/include/asm/kprobes.h
 +++ b/arch/arm64/include/asm/kprobes.h
 @@ -58,5 +58,6 @@ int kprobe_exceptions_notify(struct notifier_block *self,
  unsigned long val, void *data);
  int kprobe_breakpoint_handler(struct pt_regs *regs, unsigned int esr);
  int kprobe_single_step_handler(struct pt_regs *regs, unsigned int esr);
 +void kretprobe_trampoline(void);

  #endif /* _ARM_KPROBES_H */
 diff --git a/arch/arm64/kernel/kprobes-arm64.h 
 b/arch/arm64/kernel/kprobes-arm64.h
 index ff8a55f..bdcfa62 100644
 --- a/arch/arm64/kernel/kprobes-arm64.h
 +++ b/arch/arm64/kernel/kprobes-arm64.h
 @@ -27,4 +27,45 @@ extern kprobes_pstate_check_t * const 
 kprobe_condition_checks[16];
  enum kprobe_insn __kprobes
  arm_kprobe_decode_insn(kprobe_opcode_t insn, struct arch_specific_insn *asi);

 +#define SAVE_REGS_STRING\
 +  stp x0, x1, [sp, #16 * 0]\n\
 +  stp x2, x3, [sp, #16 * 1]\n\
 +  stp x4, x5, [sp, #16 * 2]\n\
 +  stp x6, x7, [sp, #16 * 3]\n\
 +  stp x8, x9, [sp, #16 * 4]\n\
 +  stp x10, x11, [sp, #16 * 5]\n  \
 +  stp x12, x13, [sp, #16 * 6]\n  \
 +  stp x14, x15, [sp, #16 * 7]\n  \
 +  stp x16, x17, [sp, #16 * 8]\n  \
 +  stp x18, x19, [sp, #16 * 9]\n  \
 +  stp x20, x21, [sp, #16 * 10]\n \
 +  stp x22, x23, [sp, #16 * 11]\n \
 +  stp x24, x25, [sp, #16 * 12]\n \
 +  stp x26, x27, [sp, #16 * 13]\n \
 +  stp x28, x29, [sp, #16 * 14]\n \
 +  str x30,   [sp, #16 * 15]\n\
 +  mrs x0, nzcv\n \
 +  str x0, [sp, #8 * 33]\n
 +
 +
 +#define RESTORE_REGS_STRING\
 +  ldr x0, [sp, #8 * 33]\n\
 +  msr nzcv, x0\n \
 +  ldp x0, x1, [sp, #16 * 0]\n\
 +  ldp x2, x3, [sp, #16 * 1]\n\
 +  ldp x4, x5, [sp, #16 * 2]\n\
 +  ldp x6, x7, [sp, #16 * 3]\n\
 +  ldp x8, x9, [sp, #16 * 4]\n\
 +  ldp x10, x11, [sp, #16 * 5]\n  \
 +  ldp x12, x13, [sp, #16 * 6]\n  \
 +  ldp x14, x15, [sp, #16 * 7]\n  \
 +  ldp x16, x17, [sp, #16 * 8]\n  \
 +  ldp x18, x19, [sp, #16 * 9]\n  \
 +  ldp x20, x21, [sp, #16 * 10]\n \
 +  ldp x22, x23, [sp, #16 * 11]\n \
 +  ldp x24, x25, [sp, #16 * 12]\n \
 +  ldp x26, x27, [sp, #16 * 13]\n \
 +  ldp x28, x29, [sp, #16 * 14]\n \
 +  ldr x30,   [sp, #16 * 15]\n

Do we need to restore x19..x28 as they are callee-saved?

Okay this all matches up with the definitions of the pt_regs struct.
So regs-regs[xn] are all set as is regs-pstate.

The hard coded constant offsets make me nervous though, as does the
uncertain state of the other elements of the pt_regs struct.

 +
  #endif /* _ARM_KERNEL_KPROBES_ARM64_H */
 diff --git a/arch/arm64/kernel/kprobes.c b/arch/arm64/kernel/kprobes.c
 index 6255814..570218c 100644
 --- a/arch/arm64/kernel/kprobes.c
 +++ b/arch/arm64/kernel/kprobes.c
 @@ -560,6 +560,32 @@ int __kprobes longjmp_break_handler(struct kprobe *p, 
 struct pt_regs *regs)
 return 0;
  }

 +/*
 + * When a retprobed function returns, this code saves registers and
 + * calls trampoline_handler() runs, which calls the kretprobe's handler.
 + */
 +static void __used __kprobes kretprobe_trampoline_holder(void)
 +{
 +   asm volatile (.global kretprobe_trampoline\n
 +   kretprobe_trampoline:\n
 +   sub sp, sp, %0\n
 +   SAVE_REGS_STRING
 +   mov x0, sp\n
 +   bl trampoline_probe_handler\n
 +   /* Replace trampoline address in lr with actual
 +  orig_ret_addr return address. */
 +   str x0, [sp, #16 

Re: [PATCH v7 3/7] arm64: Kprobes with single stepping support

2015-06-29 Thread Steve Capper
On 15 June 2015 at 20:07, David Long dave.l...@linaro.org wrote:
 From: Sandeepa Prabhu sandeepa.pra...@linaro.org

 Add support for basic kernel probes(kprobes) and jump probes
 (jprobes) for ARM64.

 Kprobes utilizes software breakpoint and single step debug
 exceptions supported on ARM v8.

 A software breakpoint is placed at the probe address to trap the
 kernel execution into the kprobe handler.

 ARM v8 supports enabling single stepping before the break exception
 return (ERET), with next PC in exception return address (ELR_EL1). The
 kprobe handler prepares an executable memory slot for out-of-line
 execution with a copy of the original instruction being probed, and
 enables single stepping. The PC is set to the out-of-line slot address
 before the ERET. With this scheme, the instruction is executed with the
 exact same register context except for the PC (and DAIF) registers.

 Debug mask (PSTATE.D) is enabled only when single stepping a recursive
 kprobe, e.g.: during kprobes reenter so that probed instruction can be
 single stepped within the kprobe handler -exception- context.
 The recursion depth of kprobe is always 2, i.e. upon probe re-entry,
 any further re-entry is prevented by not calling handlers and the case
 counted as a missed kprobe).

 Single stepping from the x-o-l slot has a drawback for PC-relative accesses
 like branching and symbolic literals access as the offset from the new PC
 (slot address) may not be ensured to fit in the immediate value of
 the opcode. Such instructions need simulation, so reject
 probing them.

 Instructions generating exceptions or cpu mode change are rejected
 for probing.

 Instructions using Exclusive Monitor are rejected too.

 System instructions are mostly enabled for stepping, except MSR/MRS
 accesses to DAIF flags in PSTATE, which are not safe for
 probing.

 Thanks to Steve Capper and Pratyush Anand for several suggested
 Changes.

 Signed-off-by: Sandeepa Prabhu sandeepa.pra...@linaro.org
 Signed-off-by: Steve Capper steve.cap...@linaro.org
 Signed-off-by: David A. Long dave.l...@linaro.org
 ---
  arch/arm64/Kconfig  |   1 +
  arch/arm64/include/asm/debug-monitors.h |   5 +
  arch/arm64/include/asm/kprobes.h|  62 
  arch/arm64/include/asm/probes.h |  50 +++
  arch/arm64/include/asm/ptrace.h |   3 +-
  arch/arm64/kernel/Makefile  |   1 +
  arch/arm64/kernel/debug-monitors.c  |  35 ++-
  arch/arm64/kernel/kprobes-arm64.c   |  68 
  arch/arm64/kernel/kprobes-arm64.h   |  28 ++
  arch/arm64/kernel/kprobes.c | 537 
 
  arch/arm64/kernel/kprobes.h |  24 ++
  arch/arm64/kernel/vmlinux.lds.S |   1 +
  arch/arm64/mm/fault.c   |  25 ++
  13 files changed, 829 insertions(+), 11 deletions(-)
  create mode 100644 arch/arm64/include/asm/kprobes.h
  create mode 100644 arch/arm64/include/asm/probes.h
  create mode 100644 arch/arm64/kernel/kprobes-arm64.c
  create mode 100644 arch/arm64/kernel/kprobes-arm64.h
  create mode 100644 arch/arm64/kernel/kprobes.c
  create mode 100644 arch/arm64/kernel/kprobes.h

 diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
 index 966091f..45a9bd81 100644
 --- a/arch/arm64/Kconfig
 +++ b/arch/arm64/Kconfig
 @@ -71,6 +71,7 @@ config ARM64
 select HAVE_REGS_AND_STACK_ACCESS_API
 select HAVE_RCU_TABLE_FREE
 select HAVE_SYSCALL_TRACEPOINTS
 +   select HAVE_KPROBES
 select IRQ_DOMAIN
 select MODULES_USE_ELF_RELA
 select NO_BOOTMEM
 diff --git a/arch/arm64/include/asm/debug-monitors.h 
 b/arch/arm64/include/asm/debug-monitors.h
 index 40ec68a..92d7cea 100644
 --- a/arch/arm64/include/asm/debug-monitors.h
 +++ b/arch/arm64/include/asm/debug-monitors.h
 @@ -90,6 +90,11 @@

  #define CACHE_FLUSH_IS_SAFE1

 +/* kprobes BRK opcodes with ESR encoding  */
 +#define BRK64_ESR_MASK 0x
 +#define BRK64_ESR_KPROBES  0x0004
 +#define BRK64_OPCODE_KPROBES   (AARCH64_BREAK_MON | (BRK64_ESR_KPROBES  5))
 +
  /* AArch32 */
  #define DBG_ESR_EVT_BKPT   0x4
  #define DBG_ESR_EVT_VECC   0x5
 diff --git a/arch/arm64/include/asm/kprobes.h 
 b/arch/arm64/include/asm/kprobes.h
 new file mode 100644
 index 000..af31c4d
 --- /dev/null
 +++ b/arch/arm64/include/asm/kprobes.h
 @@ -0,0 +1,62 @@
 +/*
 + * arch/arm64/include/asm/kprobes.h
 + *
 + * Copyright (C) 2013 Linaro Limited
 + *
 + * This program is free software; you can redistribute it and/or modify
 + * it under the terms of the GNU General Public License version 2 as
 + * published by the Free Software Foundation.
 + *
 + * This program is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 + * General Public License for more details.
 + */
 +
 +#ifndef _ARM_KPROBES_H
 +#define _ARM_KPROBES_H
 +
 +#include linux/types.h

Re: [PATCH v7 4/7] arm64: kprobes instruction simulation support

2015-06-29 Thread Steve Capper
On 15 June 2015 at 20:07, David Long dave.l...@linaro.org wrote:
 From: Sandeepa Prabhu sandeepa.pra...@linaro.org

 Kprobes needs simulation of instructions that cannot be stepped
 from different memory location, e.g.: those instructions
 that uses PC-relative addressing. In simulation, the behaviour
 of the instruction is implemented using a copy of pt_regs.

 Following instruction catagories are simulated:
  - All branching instructions(conditional, register, and immediate)
  - Literal access instructions(load-literal, adr/adrp)

 Conditional execution is limited to branching instructions in
 ARM v8. If conditions at PSTATE do not match the condition fields
 of opcode, the instruction is effectively NOP. Kprobes considers
 this case as 'miss'.

 Thanks to Will Cohen for assorted suggested changes.

 Signed-off-by: Sandeepa Prabhu sandeepa.pra...@linaro.org
 Signed-off-by: William Cohen wco...@redhat.com
 Signed-off-by: David A. Long dave.l...@linaro.org
 ---
  arch/arm64/kernel/Makefile   |   4 +-
  arch/arm64/kernel/kprobes-arm64.c|  98 +
  arch/arm64/kernel/kprobes-arm64.h|   2 +
  arch/arm64/kernel/kprobes.c  |  35 ++-
  arch/arm64/kernel/probes-condn-check.c   | 122 ++
  arch/arm64/kernel/probes-simulate-insn.c | 174 
 +++
  arch/arm64/kernel/probes-simulate-insn.h |  33 ++
  7 files changed, 464 insertions(+), 4 deletions(-)
  create mode 100644 arch/arm64/kernel/probes-condn-check.c
  create mode 100644 arch/arm64/kernel/probes-simulate-insn.c
  create mode 100644 arch/arm64/kernel/probes-simulate-insn.h

 diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
 index 1319872..5e9d54f 100644
 --- a/arch/arm64/kernel/Makefile
 +++ b/arch/arm64/kernel/Makefile
 @@ -32,7 +32,9 @@ arm64-obj-$(CONFIG_CPU_PM)+= sleep.o suspend.o
  arm64-obj-$(CONFIG_CPU_IDLE)   += cpuidle.o
  arm64-obj-$(CONFIG_JUMP_LABEL) += jump_label.o
  arm64-obj-$(CONFIG_KGDB)   += kgdb.o
 -arm64-obj-$(CONFIG_KPROBES)+= kprobes.o kprobes-arm64.o
 +arm64-obj-$(CONFIG_KPROBES)+= kprobes.o kprobes-arm64.o  
   \
 +  probes-simulate-insn.o 
   \
 +  probes-condn-check.o
  arm64-obj-$(CONFIG_EFI)+= efi.o efi-stub.o 
 efi-entry.o
  arm64-obj-$(CONFIG_PCI)+= pci.o
  arm64-obj-$(CONFIG_ARMV8_DEPRECATED)   += armv8_deprecated.o
 diff --git a/arch/arm64/kernel/kprobes-arm64.c 
 b/arch/arm64/kernel/kprobes-arm64.c
 index f958c52..8a7e6b0 100644
 --- a/arch/arm64/kernel/kprobes-arm64.c
 +++ b/arch/arm64/kernel/kprobes-arm64.c
 @@ -20,6 +20,76 @@
  #include asm/insn.h

  #include kprobes-arm64.h
 +#include probes-simulate-insn.h
 +
 +/*
 + * condition check functions for kprobes simulation
 + */
 +static unsigned long __kprobes
 +__check_pstate(struct kprobe *p, struct pt_regs *regs)
 +{
 +   struct arch_specific_insn *asi = p-ainsn;
 +   unsigned long pstate = regs-pstate  0x;
 +
 +   return asi-pstate_cc(pstate);
 +}
 +
 +static unsigned long __kprobes
 +__check_cbz(struct kprobe *p, struct pt_regs *regs)
 +{
 +   return check_cbz((u32)p-opcode, regs);
 +}
 +
 +static unsigned long __kprobes
 +__check_cbnz(struct kprobe *p, struct pt_regs *regs)
 +{
 +   return check_cbnz((u32)p-opcode, regs);
 +}
 +
 +static unsigned long __kprobes
 +__check_tbz(struct kprobe *p, struct pt_regs *regs)
 +{
 +   return check_tbz((u32)p-opcode, regs);
 +}
 +
 +static unsigned long __kprobes
 +__check_tbnz(struct kprobe *p, struct pt_regs *regs)
 +{
 +   return check_tbnz((u32)p-opcode, regs);
 +}
 +
 +/*
 + * prepare functions for instruction simulation
 + */
 +static void __kprobes
 +prepare_none(struct kprobe *p, struct arch_specific_insn *asi)
 +{
 +}
 +
 +static void __kprobes
 +prepare_bcond(struct kprobe *p, struct arch_specific_insn *asi)
 +{
 +   kprobe_opcode_t insn = p-opcode;
 +
 +   asi-check_condn = __check_pstate;
 +   asi-pstate_cc = kprobe_condition_checks[insn  0xf];
 +}
 +
 +static void __kprobes
 +prepare_cbz_cbnz(struct kprobe *p, struct arch_specific_insn *asi)
 +{
 +   kprobe_opcode_t insn = p-opcode;
 +
 +   asi-check_condn = (insn  (1  24)) ? __check_cbnz : __check_cbz;
 +}
 +
 +static void __kprobes
 +prepare_tbz_tbnz(struct kprobe *p, struct arch_specific_insn *asi)
 +{
 +   kprobe_opcode_t insn = p-opcode;
 +
 +   asi-check_condn = (insn  (1  24)) ? __check_tbnz : __check_tbz;
 +}

  static bool __kprobes aarch64_insn_is_steppable(u32 insn)
  {
 @@ -63,6 +133,34 @@ arm_kprobe_decode_insn(kprobe_opcode_t insn, struct 
 arch_specific_insn *asi)
  */
 if (aarch64_insn_is_steppable(insn))
 return INSN_GOOD;
 +
 +   asi-prepare = prepare_none;
 +
 +   if (aarch64_insn_is_bcond(insn)) {
 +   

Re: [PATCH v7 2/7] arm64: Add more test functions to insn.c

2015-06-29 Thread Steve Capper
Hi David,
Some comments below.

On 15 June 2015 at 20:07, David Long dave.l...@linaro.org wrote:
 From: David A. Long dave.l...@linaro.org

 Certain instructions are hard to execute correctly out-of-line (as in
 kprobes).  Test functions are added to insn.[hc] to identify these.  The
 instructions include any that use PC-relative addressing, change the PC,
 or change interrupt masking. For efficiency and simplicity test
 functions are also added for small collections of related instructions.

 Signed-off-by: David A. Long dave.l...@linaro.org
 ---
  arch/arm64/include/asm/insn.h | 18 ++
  arch/arm64/kernel/insn.c  | 28 
  2 files changed, 46 insertions(+)

 diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
 index f81b328..1fdd237 100644
 --- a/arch/arm64/include/asm/insn.h
 +++ b/arch/arm64/include/asm/insn.h
 @@ -223,8 +223,13 @@ static __always_inline bool aarch64_insn_is_##abbr(u32 
 code) \
  static __always_inline u32 aarch64_insn_get_##abbr##_value(void) \
  { return (val); }

 +__AARCH64_INSN_FUNCS(adr_adrp, 0x1F00, 0x1000)
 +__AARCH64_INSN_FUNCS(prfm_lit, 0xFF00, 0xD800)
  __AARCH64_INSN_FUNCS(str_reg,  0x3FE0EC00, 0x38206800)
  __AARCH64_INSN_FUNCS(ldr_reg,  0x3FE0EC00, 0x38606800)
 +__AARCH64_INSN_FUNCS(ldr_lit,  0xBF00, 0x1800)
 +__AARCH64_INSN_FUNCS(ldrsw_lit,0xFF00, 0x9800)
 +__AARCH64_INSN_FUNCS(exclusive,0x3F00, 0x0800)

Going one step back, if we're worried about the exclusive monitors
then we'll be worried about instructions in-between the monitor pairs
too?


  __AARCH64_INSN_FUNCS(stp_post, 0x7FC0, 0x2880)
  __AARCH64_INSN_FUNCS(ldp_post, 0x7FC0, 0x28C0)
  __AARCH64_INSN_FUNCS(stp_pre,  0x7FC0, 0x2980)
 @@ -264,19 +269,29 @@ __AARCH64_INSN_FUNCS(ands,0x7F20, 
 0x6A00)
  __AARCH64_INSN_FUNCS(bics, 0x7F20, 0x6A20)
  __AARCH64_INSN_FUNCS(b,0xFC00, 0x1400)
  __AARCH64_INSN_FUNCS(bl,   0xFC00, 0x9400)
 +__AARCH64_INSN_FUNCS(b_bl, 0x7C00, 0x1400)
 +__AARCH64_INSN_FUNCS(cb,   0x7E00, 0x3400)
  __AARCH64_INSN_FUNCS(cbz,  0x7F00, 0x3400)
  __AARCH64_INSN_FUNCS(cbnz, 0x7F00, 0x3500)
 +__AARCH64_INSN_FUNCS(tb,   0x7E00, 0x3600)
  __AARCH64_INSN_FUNCS(tbz,  0x7F00, 0x3600)
  __AARCH64_INSN_FUNCS(tbnz, 0x7F00, 0x3700)
 +__AARCH64_INSN_FUNCS(b_bl_cb_tb, 0x5C00, 0x1400)
  __AARCH64_INSN_FUNCS(bcond,0xFF10, 0x5400)
  __AARCH64_INSN_FUNCS(svc,  0xFFE0001F, 0xD401)
  __AARCH64_INSN_FUNCS(hvc,  0xFFE0001F, 0xD402)
  __AARCH64_INSN_FUNCS(smc,  0xFFE0001F, 0xD403)
  __AARCH64_INSN_FUNCS(brk,  0xFFE0001F, 0xD420)
 +__AARCH64_INSN_FUNCS(exception,0xFF00, 0xD400)
  __AARCH64_INSN_FUNCS(hint, 0xF01F, 0xD503201F)
  __AARCH64_INSN_FUNCS(br,   0xFC1F, 0xD61F)
  __AARCH64_INSN_FUNCS(blr,  0xFC1F, 0xD63F)
 +__AARCH64_INSN_FUNCS(br_blr,   0xFFDFFC1F, 0xD61F)
  __AARCH64_INSN_FUNCS(ret,  0xFC1F, 0xD65F)
 +__AARCH64_INSN_FUNCS(msr_imm,  0xFFF8F000, 0xD5004000)

Should this not be:
__AARCH64_INSN_FUNCS(msr_imm,  0xFFF8F01F, 0xD500401F)
As the lower 5 bits of an MSR (immediate) are all 1?

 +__AARCH64_INSN_FUNCS(msr_reg,  0xFFF0, 0xD510)
 +__AARCH64_INSN_FUNCS(set_clr_daif, 0xF0DF, 0xD50340DF)

Looks good, just an MSR immediate with either DAIFSet or DAIFClr.

 +__AARCH64_INSN_FUNCS(rd_wr_daif, 0xFFDFFFE0, 0xD51B4220)

Looks good, either MRS or MSR (register) where systemreg = DAIF.


  #undef __AARCH64_INSN_FUNCS

 @@ -285,6 +300,9 @@ bool aarch64_insn_is_nop(u32 insn);
  int aarch64_insn_read(void *addr, u32 *insnp);
  int aarch64_insn_write(void *addr, u32 insn);
  enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
 +bool aarch64_insn_uses_literal(u32 insn);
 +bool aarch64_insn_is_branch(u32 insn);
 +bool aarch64_insn_is_daif_access(u32 insn);
  u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
  u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
   u32 insn, u64 imm);
 diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
 index 9249020..ecd8882 100644
 --- a/arch/arm64/kernel/insn.c
 +++ b/arch/arm64/kernel/insn.c
 @@ -155,6 +155,34 @@ static bool __kprobes __aarch64_insn_hotpatch_safe(u32 
 insn)
 aarch64_insn_is_nop(insn);
  }

 +bool __kprobes aarch64_insn_uses_literal(u32 insn)
 +{
 +   /* ldr/ldrsw (literal), prfm */
 +
 +   return aarch64_insn_is_ldr_lit(insn) ||
 +   aarch64_insn_is_ldrsw_lit(insn) ||
 +   aarch64_insn_is_adr_adrp(insn) ||
 +   aarch64_insn_is_prfm_lit(insn);
 +}
 +
 +bool __kprobes aarch64_insn_is_branch(u32 insn)
 +{
 +   /* b, bl, cb*, tb*, b.cond, br, blr */
 +
 +   return 

Re: [PATCH] ARM: pgtable: Fix typo in the comment

2015-06-09 Thread Steve Capper
On 9 June 2015 at 07:52, Hyuk Myeong  wrote:
> This patch fix a spelling typo in the comment in pgtable-2level.h.
>

Hi Hyuk,

> Signed-off-by: Hyuk Myeong 
> ---
>  arch/arm/include/asm/pgtable-2level.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/arm/include/asm/pgtable-2level.h
> b/arch/arm/include/asm/pgtable-2level.h
> index bfd662e..49f91be 100644
> --- a/arch/arm/include/asm/pgtable-2level.h
> +++ b/arch/arm/include/asm/pgtable-2level.h
> @@ -66,7 +66,7 @@
>   *
>   * However, when the "young" bit is cleared, we deny access to the page
>   * by clearing the hardware PTE.  Currently Linux does not flush the TLB
> - * for us in this case, which means the TLB will retain the transation
> + * for us in this case, which means the TLB will retain the transaction

Don't you mean "translation" rather than "transaction"?

Cheers,
--
Steve

>   * until either the TLB entry is evicted under pressure, or a context
>   * switch which changes the user space mapping occurs.
>   */
> --
> 1.9.3
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ARM: pgtable: Fix typo in the comment

2015-06-09 Thread Steve Capper
On 9 June 2015 at 07:52, Hyuk Myeong hyuk.mye...@lge.com wrote:
 This patch fix a spelling typo in the comment in pgtable-2level.h.


Hi Hyuk,

 Signed-off-by: Hyuk Myeong hyuk.mye...@lge.com
 ---
  arch/arm/include/asm/pgtable-2level.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/arch/arm/include/asm/pgtable-2level.h
 b/arch/arm/include/asm/pgtable-2level.h
 index bfd662e..49f91be 100644
 --- a/arch/arm/include/asm/pgtable-2level.h
 +++ b/arch/arm/include/asm/pgtable-2level.h
 @@ -66,7 +66,7 @@
   *
   * However, when the young bit is cleared, we deny access to the page
   * by clearing the hardware PTE.  Currently Linux does not flush the TLB
 - * for us in this case, which means the TLB will retain the transation
 + * for us in this case, which means the TLB will retain the transaction

Don't you mean translation rather than transaction?

Cheers,
--
Steve

   * until either the TLB entry is evicted under pressure, or a context
   * switch which changes the user space mapping occurs.
   */
 --
 1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] Read-Only THP causes stalls (commit 10359213d)

2015-05-26 Thread Steve Capper
On 26 May 2015 at 15:35, Christoffer Dall  wrote:
> Hi Steve,
>
> On Tue, May 26, 2015 at 03:24:20PM +0100, Steve Capper wrote:
>> >> On Sun, May 24, 2015 at 09:34:04PM +0200, Christoffer Dall wrote:
>> >> > Hi all,
>> >> >
>> >> > I noticed a regression on my arm64 APM X-Gene system a couple
>> >> > of weeks back.  I would occassionally see the system lock up and see RCU
>> >> > stalls during the caching phase of kernbench.  I then wrote a small
>> >> > script that does nothing but cache the files
>> >> > (http://paste.ubuntu.com/11324767/) and ran that in a loop.  On a known
>> >> > bad commit (v4.1-rc2), out of 25 boots, I never saw it get past 21
>> >> > iterations of the loop.  I have since tried to run a bisect from v3.19 
>> >> > to
>> >> > v4.0 using 100 iterations as my criteria for a good commit.
>> >> >
>> >> > This resulted in the following first bad commit:
>> >> >
>> >> > 10359213d05acf804558bda7cc9b8422a828d1cd
>> >> > (mm: incorporate read-only pages into transparent huge pages, 
>> >> > 2015-02-11)
>> >> >
>> >> > Indeed, running the workload on v4.1-rc4 still produced the behavior,
>> >> > but reverting the above commit gets me through 100 iterations of the
>> >> > loop.
>> >> >
>> >> > I have not tried to reproduce on an x86 system.  Turning on a bunch
>> >> > of kernel debugging features *seems* to hide the problem.  My config for
>> >> > the XGene system is defconfig + CONFIG_BRIDGE and
>> >> > CONFIG_POWER_RESET_XGENE.
>> >> >
>> >> > Please let me know if I can help test patches or other things I can
>> >> > do to help.  I'm afraid that by simply reading the patch I didn't see
>> >> > anything obviously wrong with it which would cause this behavior.
>> >>
>> >> As further confirmation, could you try:
>> >>
>> >> echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
>> >
>> > this returns -EINVAL.
>> >
>> > But I'm trying now with:
>> >
>> > echo never > /sys/kernel/mm/transparent_hugepage/enabled
>> >
>> >>
>> >> and verify the problem goes away without having to revert the patch?
>> >
>> > will let you know, so far so good...
>> >
>> >>
>> >> Accordingly you should reproduce much eaiser this way (setting
>> >> $largevalue to 8192 or something, it doesn't matter).
>> >>
>> >> echo $largevalue > 
>> >> /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
>> >> echo 0 > 
>> >> /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
>> >> echo 0 > 
>> >> /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
>> >>
>> >> Then push the system into swap with some memhog -r1000 xG.
>> >
>> > what is memhog?  I couldn't find the utility in Google...
>> >
>> > I did try with the above settings and just push a bunch of data into
>> > ramfs and tmpfs and indeed the sytem died very quickly (on v4.0-rc4).
>> >
>> >>
>> >> The patch just allows readonly anon pages to be collapsed along with
>> >> read-write ones, the vma permissions allows it, so they have to be
>> >> swapcache pages, this is why swap shall be required.
>> >>
>> >> Perhaps there's some arch detail that needs fixing but it'll be easier
>> >> to track it down once you have a way to reproduce fast.
>> >>
>> > Yes, would be great to be able to reproduce quickly.
>> >
>
>> I'm trying to reproduce this on hardware here; but have been unable to
>> thus far with 4.1-rc2 on a Xgene and Seattle systems.
>
> Really?  That's concerning.  I think Andre mentioned he could
> reproduce...
>
> How many iterations have you run the caching loop for?
>
> Are you using defconfig?  I noticed that turning on debugging features
> was hiding the problem.
>
>> Also, I tried the memhog + pages_to_scan suggestion from Andrea.
>
> Any chance you could send me the memhog tool?
>
>>
>> Maybe a silly question, where is your root filesystem located? Is
>> there anything network mounted?
>>
> It's a regular ext4 on the local SATA disk.  Ubuntu Trusty.
>
> Thanks,
> -Christoffer

Sending an email to lakml appears to have been enough to make it hang
on the Xgene :-).
The system is completely frozen, not even the serial port works.

On Seattle, I've hit 100 iterations multiple times without any problems.

Investigating...

Cheers,
--
Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] Read-Only THP causes stalls (commit 10359213d)

2015-05-26 Thread Steve Capper
On 26 May 2015 at 09:08, Christoffer Dall  wrote:
> Hi Andrea,
>
> On Mon, May 25, 2015 at 04:15:25PM +0200, Andrea Arcangeli wrote:
>> Hello Christoffer,
>>
>> On Sun, May 24, 2015 at 09:34:04PM +0200, Christoffer Dall wrote:
>> > Hi all,
>> >
>> > I noticed a regression on my arm64 APM X-Gene system a couple
>> > of weeks back.  I would occassionally see the system lock up and see RCU
>> > stalls during the caching phase of kernbench.  I then wrote a small
>> > script that does nothing but cache the files
>> > (http://paste.ubuntu.com/11324767/) and ran that in a loop.  On a known
>> > bad commit (v4.1-rc2), out of 25 boots, I never saw it get past 21
>> > iterations of the loop.  I have since tried to run a bisect from v3.19 to
>> > v4.0 using 100 iterations as my criteria for a good commit.
>> >
>> > This resulted in the following first bad commit:
>> >
>> > 10359213d05acf804558bda7cc9b8422a828d1cd
>> > (mm: incorporate read-only pages into transparent huge pages, 2015-02-11)
>> >
>> > Indeed, running the workload on v4.1-rc4 still produced the behavior,
>> > but reverting the above commit gets me through 100 iterations of the
>> > loop.
>> >
>> > I have not tried to reproduce on an x86 system.  Turning on a bunch
>> > of kernel debugging features *seems* to hide the problem.  My config for
>> > the XGene system is defconfig + CONFIG_BRIDGE and
>> > CONFIG_POWER_RESET_XGENE.
>> >
>> > Please let me know if I can help test patches or other things I can
>> > do to help.  I'm afraid that by simply reading the patch I didn't see
>> > anything obviously wrong with it which would cause this behavior.
>>
>> As further confirmation, could you try:
>>
>> echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
>
> this returns -EINVAL.
>
> But I'm trying now with:
>
> echo never > /sys/kernel/mm/transparent_hugepage/enabled
>
>>
>> and verify the problem goes away without having to revert the patch?
>
> will let you know, so far so good...
>
>>
>> Accordingly you should reproduce much eaiser this way (setting
>> $largevalue to 8192 or something, it doesn't matter).
>>
>> echo $largevalue > 
>> /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
>> echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
>> echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
>>
>> Then push the system into swap with some memhog -r1000 xG.
>
> what is memhog?  I couldn't find the utility in Google...
>
> I did try with the above settings and just push a bunch of data into
> ramfs and tmpfs and indeed the sytem died very quickly (on v4.0-rc4).
>
>>
>> The patch just allows readonly anon pages to be collapsed along with
>> read-write ones, the vma permissions allows it, so they have to be
>> swapcache pages, this is why swap shall be required.
>>
>> Perhaps there's some arch detail that needs fixing but it'll be easier
>> to track it down once you have a way to reproduce fast.
>>
> Yes, would be great to be able to reproduce quickly.
>
> Thanks,
> -Christoffer
>

Hi Christoffer,
I'm trying to reproduce this on hardware here; but have been unable to
thus far with 4.1-rc2 on a Xgene and Seattle systems.
Also, I tried the memhog + pages_to_scan suggestion from Andrea.

Maybe a silly question, where is your root filesystem located? Is
there anything network mounted?

Cheers,
--
Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] Read-Only THP causes stalls (commit 10359213d)

2015-05-26 Thread Steve Capper
On 26 May 2015 at 09:08, Christoffer Dall christoffer.d...@linaro.org wrote:
 Hi Andrea,

 On Mon, May 25, 2015 at 04:15:25PM +0200, Andrea Arcangeli wrote:
 Hello Christoffer,

 On Sun, May 24, 2015 at 09:34:04PM +0200, Christoffer Dall wrote:
  Hi all,
 
  I noticed a regression on my arm64 APM X-Gene system a couple
  of weeks back.  I would occassionally see the system lock up and see RCU
  stalls during the caching phase of kernbench.  I then wrote a small
  script that does nothing but cache the files
  (http://paste.ubuntu.com/11324767/) and ran that in a loop.  On a known
  bad commit (v4.1-rc2), out of 25 boots, I never saw it get past 21
  iterations of the loop.  I have since tried to run a bisect from v3.19 to
  v4.0 using 100 iterations as my criteria for a good commit.
 
  This resulted in the following first bad commit:
 
  10359213d05acf804558bda7cc9b8422a828d1cd
  (mm: incorporate read-only pages into transparent huge pages, 2015-02-11)
 
  Indeed, running the workload on v4.1-rc4 still produced the behavior,
  but reverting the above commit gets me through 100 iterations of the
  loop.
 
  I have not tried to reproduce on an x86 system.  Turning on a bunch
  of kernel debugging features *seems* to hide the problem.  My config for
  the XGene system is defconfig + CONFIG_BRIDGE and
  CONFIG_POWER_RESET_XGENE.
 
  Please let me know if I can help test patches or other things I can
  do to help.  I'm afraid that by simply reading the patch I didn't see
  anything obviously wrong with it which would cause this behavior.

 As further confirmation, could you try:

 echo 0  /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan

 this returns -EINVAL.

 But I'm trying now with:

 echo never  /sys/kernel/mm/transparent_hugepage/enabled


 and verify the problem goes away without having to revert the patch?

 will let you know, so far so good...


 Accordingly you should reproduce much eaiser this way (setting
 $largevalue to 8192 or something, it doesn't matter).

 echo $largevalue  
 /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
 echo 0  /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
 echo 0  /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs

 Then push the system into swap with some memhog -r1000 xG.

 what is memhog?  I couldn't find the utility in Google...

 I did try with the above settings and just push a bunch of data into
 ramfs and tmpfs and indeed the sytem died very quickly (on v4.0-rc4).


 The patch just allows readonly anon pages to be collapsed along with
 read-write ones, the vma permissions allows it, so they have to be
 swapcache pages, this is why swap shall be required.

 Perhaps there's some arch detail that needs fixing but it'll be easier
 to track it down once you have a way to reproduce fast.

 Yes, would be great to be able to reproduce quickly.

 Thanks,
 -Christoffer


Hi Christoffer,
I'm trying to reproduce this on hardware here; but have been unable to
thus far with 4.1-rc2 on a Xgene and Seattle systems.
Also, I tried the memhog + pages_to_scan suggestion from Andrea.

Maybe a silly question, where is your root filesystem located? Is
there anything network mounted?

Cheers,
--
Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] Read-Only THP causes stalls (commit 10359213d)

2015-05-26 Thread Steve Capper
On 26 May 2015 at 15:35, Christoffer Dall christoffer.d...@linaro.org wrote:
 Hi Steve,

 On Tue, May 26, 2015 at 03:24:20PM +0100, Steve Capper wrote:
  On Sun, May 24, 2015 at 09:34:04PM +0200, Christoffer Dall wrote:
   Hi all,
  
   I noticed a regression on my arm64 APM X-Gene system a couple
   of weeks back.  I would occassionally see the system lock up and see RCU
   stalls during the caching phase of kernbench.  I then wrote a small
   script that does nothing but cache the files
   (http://paste.ubuntu.com/11324767/) and ran that in a loop.  On a known
   bad commit (v4.1-rc2), out of 25 boots, I never saw it get past 21
   iterations of the loop.  I have since tried to run a bisect from v3.19 
   to
   v4.0 using 100 iterations as my criteria for a good commit.
  
   This resulted in the following first bad commit:
  
   10359213d05acf804558bda7cc9b8422a828d1cd
   (mm: incorporate read-only pages into transparent huge pages, 
   2015-02-11)
  
   Indeed, running the workload on v4.1-rc4 still produced the behavior,
   but reverting the above commit gets me through 100 iterations of the
   loop.
  
   I have not tried to reproduce on an x86 system.  Turning on a bunch
   of kernel debugging features *seems* to hide the problem.  My config for
   the XGene system is defconfig + CONFIG_BRIDGE and
   CONFIG_POWER_RESET_XGENE.
  
   Please let me know if I can help test patches or other things I can
   do to help.  I'm afraid that by simply reading the patch I didn't see
   anything obviously wrong with it which would cause this behavior.
 
  As further confirmation, could you try:
 
  echo 0  /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
 
  this returns -EINVAL.
 
  But I'm trying now with:
 
  echo never  /sys/kernel/mm/transparent_hugepage/enabled
 
 
  and verify the problem goes away without having to revert the patch?
 
  will let you know, so far so good...
 
 
  Accordingly you should reproduce much eaiser this way (setting
  $largevalue to 8192 or something, it doesn't matter).
 
  echo $largevalue  
  /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
  echo 0  
  /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
  echo 0  
  /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
 
  Then push the system into swap with some memhog -r1000 xG.
 
  what is memhog?  I couldn't find the utility in Google...
 
  I did try with the above settings and just push a bunch of data into
  ramfs and tmpfs and indeed the sytem died very quickly (on v4.0-rc4).
 
 
  The patch just allows readonly anon pages to be collapsed along with
  read-write ones, the vma permissions allows it, so they have to be
  swapcache pages, this is why swap shall be required.
 
  Perhaps there's some arch detail that needs fixing but it'll be easier
  to track it down once you have a way to reproduce fast.
 
  Yes, would be great to be able to reproduce quickly.
 

 I'm trying to reproduce this on hardware here; but have been unable to
 thus far with 4.1-rc2 on a Xgene and Seattle systems.

 Really?  That's concerning.  I think Andre mentioned he could
 reproduce...

 How many iterations have you run the caching loop for?

 Are you using defconfig?  I noticed that turning on debugging features
 was hiding the problem.

 Also, I tried the memhog + pages_to_scan suggestion from Andrea.

 Any chance you could send me the memhog tool?


 Maybe a silly question, where is your root filesystem located? Is
 there anything network mounted?

 It's a regular ext4 on the local SATA disk.  Ubuntu Trusty.

 Thanks,
 -Christoffer

Sending an email to lakml appears to have been enough to make it hang
on the Xgene :-).
The system is completely frozen, not even the serial port works.

On Seattle, I've hit 100 iterations multiple times without any problems.

Investigating...

Cheers,
--
Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   >