Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

2008-02-25 Thread Hans Rosenfeld
On Sat, Feb 23, 2008 at 10:31:01AM -0800, Dave Hansen wrote:
> > > - 4 bits for the page size, with 0 meaning native page size (4k on x86,
> > >   8k on alpha, ...) and values 1-15 being specific to the architecture
> > >   (I used 1 for 2M, 2 for 4M and 3 for 1G for x86)
> 
> "Native page size" probably a bad idea.  ppc64 can use 64k or 4k for its
> "native" page size and has 16MB large pages (as well as some others).
> To make it even more confusing, you can have a 64k kernel page size with
> 4k mmu mappings!
> 
> That said, this is a decent idea as long as we know that nobody will
> ever have more than 16 page sizes.  

Then a better way to encode the page size would be returning the page
shift. This would need 6 bits instead of 4, but it would probably be
enough for any 64 bit architecture.


> > This is ok-ish, but I can't say I like it much. Especially the page size
> > field.
> > 
> > But I don't really have many ideas here. Perhaps having a bit saying
> > "this entry is really a continuation of the previous one". Then any page
> > size can be trivially represented. This might also make the code on both
> > sides simpler?

I don't like the idea of parsing thousands of entries just to find out
that I'm using a huge page. It would be much better to just get the page
size one way or the other in the first entry one reads.


> > > -static int add_to_pagemap(unsigned long addr, u64 pfn,
> > > +struct ppte {
> > > + uint64_t paddr:58;
> > > + uint64_t psize:4;
> > > + uint64_t swap:1;
> > > + uint64_t present:1;
> > > +};
> 
> It'd be nice to keep the current convention, which is to stay away from
> bitfields.

I like them, they make the code much more readable.


> > > +#ifdef CONFIG_X86
> > > + if (pmd_huge(*pmd)) {
> > > + struct ppte ppte = { 
> > > + .paddr = pmd_pfn(*pmd) << PAGE_SHIFT,
> > > + .psize = (HPAGE_SHIFT == 22 ?
> > > +   PM_PSIZE_4M : PM_PSIZE_2M),
> > > + .swap  = 0,
> > > + .present = 1,
> > > + };
> > > +
> > > + for(; addr != end; addr += PAGE_SIZE) {
> > > + err = add_to_pagemap(addr, ppte, pm);
> > > + if (err)
> > > + return err;
> > > + }
> > > + } else
> > > +#endif
> 
> It's great to make this support huge pages, but things like this
> probably need their own function.  Putting an #ifdef in the middle of
> here makes it a lot harder to read.  Just think of when powerpc, ia64
> and x86_64 get their grubby mitts in here. ;)

AFAIK the way huge pages are used on x86 differs much from other
architectures. While on x86 the address translation stops at some upper
table for a huge page, other architectures encode the page size in the
pte (at least the ones I looked at did). So pmd_huge() (and soon
pud_huge()) are very x86-specific and return just 0 on other archs, and
this code would be optimized away for them. All that would be necessary
for other archs is to eventually get the page size from the pte and put
it in the psize field.

The #ifdef could go away if pmd_pfn() was defined as 0 for !x86, it
wouldn't make sense to use it anyway.



-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

2008-02-20 Thread Hans Rosenfeld
The current code for /proc/pid/pagemap does not work with huge pages (on
x86). The code will make no difference between a normal pmd and a huge
page pmd, trying to parse the contents of the huge page as ptes. Another
problem is that there is no way to get information about the page size a
specific mapping uses.

Also, the current way the "not present" and "swap" bits are encoded in
the returned pfn isn't very clean, especially not if this interface is
going to be extended.

I propose to change /proc/pid/pagemap to return a pseudo-pte instead of
just a raw pfn. The pseudo-pte will contain:

- 58 bits for the physical address of the first byte in the page, even
  less bits would probably be sufficient for quite a while

- 4 bits for the page size, with 0 meaning native page size (4k on x86,
  8k on alpha, ...) and values 1-15 being specific to the architecture
  (I used 1 for 2M, 2 for 4M and 3 for 1G for x86)

- a "swap" bit indicating that a not present page is paged out, with the
  physical address field containing page file number and block number
  just like before

- a "present" bit just like in a real pte
  
By shortening the field for the physical address, some more interesting
information could be included, like read/write permissions and the like.
The page size could also be returned directly, 6 bits could be used to
express any page shift in a 64 bit system, but I found the encoded page
size more useful for my specific use case.


The attached patch changes the /proc/pid/pagemap code to use such a
pseudo-pte. The huge page handling is currently limited to 2M/4M pages
on x86, 1G pages will need some more work. To keep the simple mapping of
virtual addresses to file index intact, any huge page pseudo-pte is
replicated in the user buffer to map the equivalent range of small
pages. 

Note that I had to move the pmd_pfn() macro from asm-x86/pgtable_64.h to
asm-x86/pgtable.h, it applies to both 32 bit and 64 bit x86.

Other architectures will probably need other changes to support huge
pages and return the page size.

I think that the definition of the pseudo-pte structure and the page
size codes should be made available through a header file, but I didn't
do this for now.

Signed-Off-By: Hans Rosenfeld <[EMAIL PROTECTED]>

---
 fs/proc/task_mmu.c   |   68 +
 include/asm-x86/pgtable.h|2 +
 include/asm-x86/pgtable_64.h |1 -
 3 files changed, 50 insertions(+), 21 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 49958cf..58af588 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -527,16 +527,23 @@ struct pagemapread {
char __user *out, *end;
 };
 
-#define PM_ENTRY_BYTES sizeof(u64)
-#define PM_RESERVED_BITS3
-#define PM_RESERVED_OFFSET  (64 - PM_RESERVED_BITS)
-#define PM_RESERVED_MASK(((1LL<out + PM_ENTRY_BYTES >= pm->end) {
-   if (copy_to_user(pm->out, &pfn, pm->end - pm->out))
+   if (copy_to_user(pm->out, &ppte, pm->end - pm->out))
return -EFAULT;
pm->out = pm->end;
return PM_END_OF_BUFFER;
}
 
-   if (put_user(pfn, pm->out))
+   if (copy_to_user(pm->out, &ppte, sizeof(ppte)))
return -EFAULT;
pm->out += PM_ENTRY_BYTES;
return 0;
@@ -564,7 +571,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned 
long end,
unsigned long addr;
int err = 0;
for (addr = start; addr < end; addr += PAGE_SIZE) {
-   err = add_to_pagemap(addr, PM_NOT_PRESENT, pm);
+   err = add_to_pagemap(addr, (struct ppte) {0, 0, 0, 0}, pm);
if (err)
break;
}
@@ -574,7 +581,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned 
long end,
 u64 swap_pte_to_pagemap_entry(pte_t pte)
 {
swp_entry_t e = pte_to_swp_entry(pte);
-   return PM_SWAP | swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
+   return swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
 }
 
 static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
@@ -584,16 +591,37 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long 
addr, unsigned long end,
pte_t *pte;
int err = 0;
 
+#ifdef CONFIG_X86
+   if (pmd_huge(*pmd)) {
+   struct ppte ppte = { 
+   .paddr = pmd_pfn(*pmd) << PAGE_SHIFT,
+   .psize = (HPAGE_SHIFT == 22 ?
+ PM_PSIZE_4M : PM_PSIZE_2M),
+   .swap  = 0,
+   .present = 1,
+   };
+
+   for(; addr != end; addr += PAGE_SIZE) {
+   err = add_to_pagemap(addr, ppte, pm);
+   if (err)
+   return err;

[PATCH] x86: fix pmd_bad and pud_bad to support huge pages

2008-02-18 Thread Hans Rosenfeld
I recently stumbled upon a problem in the support for huge pages. If a
program using huge pages does not explicitly unmap them, they remain
mapped (and therefore, are lost) after the program exits.

I observed that the free huge page count in /proc/meminfo decreased when
running my program, and it did not increase after the program exited.
After running the program a few times, no more huge pages could be
allocated.

The reason for this seems to be that the x86 pmd_bad and pud_bad
consider pmd/pud entries having the PSE bit set invalid. I think there
is nothing wrong with this bit being set, it just indicates that the
lowest level of translation has been reached. This bit has to be (and
is) checked after the basic validity of the entry has been checked, like
in this fragment from follow_page() in mm/memory.c:

  if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
  goto no_page_table;

  if (pmd_huge(*pmd)) {
  BUG_ON(flags & FOLL_GET);
  page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
  goto out;
  }

Note that this code currently doesn't work as intended if the pmd refers
to a huge page, the pmd_huge() check can not be reached if the page is
huge.

Extending pmd_bad() (and, for future 1GB page support, pud_bad()) to
allow for the PSE bit being set fixes this. For similar reasons,
allowing the NX bit being set is necessary, too. I have seen huge pages
having the NX bit set in their pmd entry, which would cause the same
problem.




Signed-Off-By: Hans Rosenfeld <[EMAIL PROTECTED]>

---
 include/asm-x86/pgtable_32.h |4 +++-
 include/asm-x86/pgtable_64.h |6 --
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/asm-x86/pgtable_32.h b/include/asm-x86/pgtable_32.h
index a842c72..b478efa 100644
--- a/include/asm-x86/pgtable_32.h
+++ b/include/asm-x86/pgtable_32.h
@@ -91,7 +91,9 @@ extern unsigned long pg0[];
 /* To avoid harmful races, pmd_none(x) should check only the lower when PAE */
 #define pmd_none(x)(!(unsigned long)pmd_val(x))
 #define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT)
-#definepmd_bad(x)  ((pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER)) != 
_KERNPG_TABLE)
+#definepmd_bad(x)  ((pmd_val(x) \
+ & ~(PAGE_MASK | _PAGE_USER | _PAGE_PSE | _PAGE_NX)) \
+!= _KERNPG_TABLE)
 
 
 #define pages_to_mb(x) ((x) >> (20-PAGE_SHIFT))
diff --git a/include/asm-x86/pgtable_64.h b/include/asm-x86/pgtable_64.h
index bd4740a..02bd4aa 100644
--- a/include/asm-x86/pgtable_64.h
+++ b/include/asm-x86/pgtable_64.h
@@ -153,12 +153,14 @@ static inline unsigned long pgd_bad(pgd_t pgd)
 
 static inline unsigned long pud_bad(pud_t pud)
 {
-   return pud_val(pud) & ~(PTE_MASK | _KERNPG_TABLE | _PAGE_USER);
+   return pud_val(pud) &
+   ~(PTE_MASK | _KERNPG_TABLE | _PAGE_USER | _PAGE_PSE | _PAGE_NX);
 }
 
 static inline unsigned long pmd_bad(pmd_t pmd)
 {
-   return pmd_val(pmd) & ~(PTE_MASK | _KERNPG_TABLE | _PAGE_USER);
+   return pmd_val(pmd) &
+   ~(PTE_MASK | _KERNPG_TABLE | _PAGE_USER | _PAGE_PSE | _PAGE_NX);
 }
 
 #define pte_none(x)(!pte_val(x))
-- 
1.5.3.7

-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] /proc/pid/pagemap: fix PM_SPECIAL macro

2008-02-11 Thread Hans Rosenfeld
There seems to be a bug in the PM_SPECIAL macro for /proc/pid/pagemap.
I think masking out those other bits makes more sense then setting all
those mask bits.


Signed-off-by: Hans Rosenfeld <[EMAIL PROTECTED]>

---
 fs/proc/task_mmu.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ae4d3f2..2e8a6fa 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -531,7 +531,7 @@ struct pagemapread {
 #define PM_RESERVED_BITS3
 #define PM_RESERVED_OFFSET  (64 - PM_RESERVED_BITS)
 #define PM_RESERVED_MASK(((1LL<http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/