Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

2008-02-25 Thread Dave Hansen

On Mon, 2008-02-25 at 13:09 +0100, Hans Rosenfeld wrote:
> On Sat, Feb 23, 2008 at 10:31:01AM -0800, Dave Hansen wrote:
> > > > - 4 bits for the page size, with 0 meaning native page size (4k on x86,
> > > >   8k on alpha, ...) and values 1-15 being specific to the architecture
> > > >   (I used 1 for 2M, 2 for 4M and 3 for 1G for x86)
> > 
> > "Native page size" probably a bad idea.  ppc64 can use 64k or 4k for its
> > "native" page size and has 16MB large pages (as well as some others).
> > To make it even more confusing, you can have a 64k kernel page size with
> > 4k mmu mappings!
> > 
> > That said, this is a decent idea as long as we know that nobody will
> > ever have more than 16 page sizes.  
> 
> Then a better way to encode the page size would be returning the page
> shift. This would need 6 bits instead of 4, but it would probably be
> enough for any 64 bit architecture.

That's a good point.

> > > This is ok-ish, but I can't say I like it much. Especially the page size
> > > field.
> > > 
> > > But I don't really have many ideas here. Perhaps having a bit saying
> > > "this entry is really a continuation of the previous one". Then any page
> > > size can be trivially represented. This might also make the code on both
> > > sides simpler?
> 
> I don't like the idea of parsing thousands of entries just to find out
> that I'm using a huge page. It would be much better to just get the page
> size one way or the other in the first entry one reads.

Did you read my suggestion?  We use one bit in the pte to specify that
its a large page mapping, then specify a mask to apply to the address to
get the *first* mapping of the large page, where you're find the actual
physical address.  That keeps us from having to worry about specifying
*both* the page size and the pfn in the same pte.

> > > > +#ifdef CONFIG_X86
> > > > +   if (pmd_huge(*pmd)) {
> > > > +   struct ppte ppte = { 
> > > > +   .paddr = pmd_pfn(*pmd) << PAGE_SHIFT,
> > > > +   .psize = (HPAGE_SHIFT == 22 ?
> > > > + PM_PSIZE_4M : PM_PSIZE_2M),
> > > > +   .swap  = 0,
> > > > +   .present = 1,
> > > > +   };
> > > > +
> > > > +   for(; addr != end; addr += PAGE_SIZE) {
> > > > +   err = add_to_pagemap(addr, ppte, pm);
> > > > +   if (err)
> > > > +   return err;
> > > > +   }
> > > > +   } else
> > > > +#endif
> > 
> > It's great to make this support huge pages, but things like this
> > probably need their own function.  Putting an #ifdef in the middle of
> > here makes it a lot harder to read.  Just think of when powerpc, ia64
> > and x86_64 get their grubby mitts in here. ;)
> 
> AFAIK the way huge pages are used on x86 differs much from other
> architectures. While on x86 the address translation stops at some upper
> table for a huge page, other architectures encode the page size in the
> pte (at least the ones I looked at did). So pmd_huge() (and soon
> pud_huge()) are very x86-specific and return just 0 on other archs,

I'm just asking that you please put this in a nice helper function to
hide it from the poor powerpc/ia64/mips... guys that don't want to see
x86 code cluttering up otherwise generic functions.  Yes, the compiler
optimizes it away, but it still has a cost to my eyeballs. :)

I eagerly await your next patch!

-- Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

2008-02-25 Thread Hans Rosenfeld
On Sat, Feb 23, 2008 at 10:31:01AM -0800, Dave Hansen wrote:
> > > - 4 bits for the page size, with 0 meaning native page size (4k on x86,
> > >   8k on alpha, ...) and values 1-15 being specific to the architecture
> > >   (I used 1 for 2M, 2 for 4M and 3 for 1G for x86)
> 
> "Native page size" probably a bad idea.  ppc64 can use 64k or 4k for its
> "native" page size and has 16MB large pages (as well as some others).
> To make it even more confusing, you can have a 64k kernel page size with
> 4k mmu mappings!
> 
> That said, this is a decent idea as long as we know that nobody will
> ever have more than 16 page sizes.  

Then a better way to encode the page size would be returning the page
shift. This would need 6 bits instead of 4, but it would probably be
enough for any 64 bit architecture.


> > This is ok-ish, but I can't say I like it much. Especially the page size
> > field.
> > 
> > But I don't really have many ideas here. Perhaps having a bit saying
> > "this entry is really a continuation of the previous one". Then any page
> > size can be trivially represented. This might also make the code on both
> > sides simpler?

I don't like the idea of parsing thousands of entries just to find out
that I'm using a huge page. It would be much better to just get the page
size one way or the other in the first entry one reads.


> > > -static int add_to_pagemap(unsigned long addr, u64 pfn,
> > > +struct ppte {
> > > + uint64_t paddr:58;
> > > + uint64_t psize:4;
> > > + uint64_t swap:1;
> > > + uint64_t present:1;
> > > +};
> 
> It'd be nice to keep the current convention, which is to stay away from
> bitfields.

I like them, they make the code much more readable.


> > > +#ifdef CONFIG_X86
> > > + if (pmd_huge(*pmd)) {
> > > + struct ppte ppte = { 
> > > + .paddr = pmd_pfn(*pmd) << PAGE_SHIFT,
> > > + .psize = (HPAGE_SHIFT == 22 ?
> > > +   PM_PSIZE_4M : PM_PSIZE_2M),
> > > + .swap  = 0,
> > > + .present = 1,
> > > + };
> > > +
> > > + for(; addr != end; addr += PAGE_SIZE) {
> > > + err = add_to_pagemap(addr, ppte, pm);
> > > + if (err)
> > > + return err;
> > > + }
> > > + } else
> > > +#endif
> 
> It's great to make this support huge pages, but things like this
> probably need their own function.  Putting an #ifdef in the middle of
> here makes it a lot harder to read.  Just think of when powerpc, ia64
> and x86_64 get their grubby mitts in here. ;)

AFAIK the way huge pages are used on x86 differs much from other
architectures. While on x86 the address translation stops at some upper
table for a huge page, other architectures encode the page size in the
pte (at least the ones I looked at did). So pmd_huge() (and soon
pud_huge()) are very x86-specific and return just 0 on other archs, and
this code would be optimized away for them. All that would be necessary
for other archs is to eventually get the page size from the pte and put
it in the psize field.

The #ifdef could go away if pmd_pfn() was defined as 0 for !x86, it
wouldn't make sense to use it anyway.



-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

2008-02-25 Thread Hans Rosenfeld
On Sat, Feb 23, 2008 at 10:31:01AM -0800, Dave Hansen wrote:
   - 4 bits for the page size, with 0 meaning native page size (4k on x86,
 8k on alpha, ...) and values 1-15 being specific to the architecture
 (I used 1 for 2M, 2 for 4M and 3 for 1G for x86)
 
 Native page size probably a bad idea.  ppc64 can use 64k or 4k for its
 native page size and has 16MB large pages (as well as some others).
 To make it even more confusing, you can have a 64k kernel page size with
 4k mmu mappings!
 
 That said, this is a decent idea as long as we know that nobody will
 ever have more than 16 page sizes.  

Then a better way to encode the page size would be returning the page
shift. This would need 6 bits instead of 4, but it would probably be
enough for any 64 bit architecture.


  This is ok-ish, but I can't say I like it much. Especially the page size
  field.
  
  But I don't really have many ideas here. Perhaps having a bit saying
  this entry is really a continuation of the previous one. Then any page
  size can be trivially represented. This might also make the code on both
  sides simpler?

I don't like the idea of parsing thousands of entries just to find out
that I'm using a huge page. It would be much better to just get the page
size one way or the other in the first entry one reads.


   -static int add_to_pagemap(unsigned long addr, u64 pfn,
   +struct ppte {
   + uint64_t paddr:58;
   + uint64_t psize:4;
   + uint64_t swap:1;
   + uint64_t present:1;
   +};
 
 It'd be nice to keep the current convention, which is to stay away from
 bitfields.

I like them, they make the code much more readable.


   +#ifdef CONFIG_X86
   + if (pmd_huge(*pmd)) {
   + struct ppte ppte = { 
   + .paddr = pmd_pfn(*pmd)  PAGE_SHIFT,
   + .psize = (HPAGE_SHIFT == 22 ?
   +   PM_PSIZE_4M : PM_PSIZE_2M),
   + .swap  = 0,
   + .present = 1,
   + };
   +
   + for(; addr != end; addr += PAGE_SIZE) {
   + err = add_to_pagemap(addr, ppte, pm);
   + if (err)
   + return err;
   + }
   + } else
   +#endif
 
 It's great to make this support huge pages, but things like this
 probably need their own function.  Putting an #ifdef in the middle of
 here makes it a lot harder to read.  Just think of when powerpc, ia64
 and x86_64 get their grubby mitts in here. ;)

AFAIK the way huge pages are used on x86 differs much from other
architectures. While on x86 the address translation stops at some upper
table for a huge page, other architectures encode the page size in the
pte (at least the ones I looked at did). So pmd_huge() (and soon
pud_huge()) are very x86-specific and return just 0 on other archs, and
this code would be optimized away for them. All that would be necessary
for other archs is to eventually get the page size from the pte and put
it in the psize field.

The #ifdef could go away if pmd_pfn() was defined as 0 for !x86, it
wouldn't make sense to use it anyway.



-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

2008-02-25 Thread Dave Hansen

On Mon, 2008-02-25 at 13:09 +0100, Hans Rosenfeld wrote:
 On Sat, Feb 23, 2008 at 10:31:01AM -0800, Dave Hansen wrote:
- 4 bits for the page size, with 0 meaning native page size (4k on x86,
  8k on alpha, ...) and values 1-15 being specific to the architecture
  (I used 1 for 2M, 2 for 4M and 3 for 1G for x86)
  
  Native page size probably a bad idea.  ppc64 can use 64k or 4k for its
  native page size and has 16MB large pages (as well as some others).
  To make it even more confusing, you can have a 64k kernel page size with
  4k mmu mappings!
  
  That said, this is a decent idea as long as we know that nobody will
  ever have more than 16 page sizes.  
 
 Then a better way to encode the page size would be returning the page
 shift. This would need 6 bits instead of 4, but it would probably be
 enough for any 64 bit architecture.

That's a good point.

   This is ok-ish, but I can't say I like it much. Especially the page size
   field.
   
   But I don't really have many ideas here. Perhaps having a bit saying
   this entry is really a continuation of the previous one. Then any page
   size can be trivially represented. This might also make the code on both
   sides simpler?
 
 I don't like the idea of parsing thousands of entries just to find out
 that I'm using a huge page. It would be much better to just get the page
 size one way or the other in the first entry one reads.

Did you read my suggestion?  We use one bit in the pte to specify that
its a large page mapping, then specify a mask to apply to the address to
get the *first* mapping of the large page, where you're find the actual
physical address.  That keeps us from having to worry about specifying
*both* the page size and the pfn in the same pte.

+#ifdef CONFIG_X86
+   if (pmd_huge(*pmd)) {
+   struct ppte ppte = { 
+   .paddr = pmd_pfn(*pmd)  PAGE_SHIFT,
+   .psize = (HPAGE_SHIFT == 22 ?
+ PM_PSIZE_4M : PM_PSIZE_2M),
+   .swap  = 0,
+   .present = 1,
+   };
+
+   for(; addr != end; addr += PAGE_SIZE) {
+   err = add_to_pagemap(addr, ppte, pm);
+   if (err)
+   return err;
+   }
+   } else
+#endif
  
  It's great to make this support huge pages, but things like this
  probably need their own function.  Putting an #ifdef in the middle of
  here makes it a lot harder to read.  Just think of when powerpc, ia64
  and x86_64 get their grubby mitts in here. ;)
 
 AFAIK the way huge pages are used on x86 differs much from other
 architectures. While on x86 the address translation stops at some upper
 table for a huge page, other architectures encode the page size in the
 pte (at least the ones I looked at did). So pmd_huge() (and soon
 pud_huge()) are very x86-specific and return just 0 on other archs,

I'm just asking that you please put this in a nice helper function to
hide it from the poor powerpc/ia64/mips... guys that don't want to see
x86 code cluttering up otherwise generic functions.  Yes, the compiler
optimizes it away, but it still has a cost to my eyeballs. :)

I eagerly await your next patch!

-- Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

2008-02-23 Thread Dave Hansen
On Sat, 2008-02-23 at 10:18 +0800, Matt Mackall wrote:
> Another
> > problem is that there is no way to get information about the page size a
> > specific mapping uses.

Is this true generically, or just with pagemap?  It seems like we should
have a way to tell that a particular mapping is of large pages.  I'm
cc'ing a few folks who might know.

> > Also, the current way the "not present" and "swap" bits are encoded in
> > the returned pfn isn't very clean, especially not if this interface is
> > going to be extended.
> 
> Fair.

Yup.

> > I propose to change /proc/pid/pagemap to return a pseudo-pte instead of
> > just a raw pfn. The pseudo-pte will contain:
> > 
> > - 58 bits for the physical address of the first byte in the page, even
> >   less bits would probably be sufficient for quite a while

Well, whether we use a physical address of the first byte of the page or
a pfn doesn't really matter.  It just boils down to whether we use low
or high bits for the magic. :)

> > - 4 bits for the page size, with 0 meaning native page size (4k on x86,
> >   8k on alpha, ...) and values 1-15 being specific to the architecture
> >   (I used 1 for 2M, 2 for 4M and 3 for 1G for x86)

"Native page size" probably a bad idea.  ppc64 can use 64k or 4k for its
"native" page size and has 16MB large pages (as well as some others).
To make it even more confusing, you can have a 64k kernel page size with
4k mmu mappings!

That said, this is a decent idea as long as we know that nobody will
ever have more than 16 page sizes.  

> > - a "swap" bit indicating that a not present page is paged out, with the
> >   physical address field containing page file number and block number
> >   just like before
> > 
> > - a "present" bit just like in a real pte
> 
> This is ok-ish, but I can't say I like it much. Especially the page size
> field.
> 
> But I don't really have many ideas here. Perhaps having a bit saying
> "this entry is really a continuation of the previous one". Then any page
> size can be trivially represented. This might also make the code on both
> sides simpler?

Yeah, it could just be a special flag plus a mask or offset showing how
many entries to back up to find the actual mapping.  If each huge page
entry just had something along the lines of:

PAGEMAP_HUGE_PAGE_BIT | HPAGE_MASK

You can see its a huge mapping from the bit, and you can go find the
physical page by applying HPAGE_MASK to your current position in the
pagemap.

> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 49958cf..58af588 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -527,16 +527,23 @@ struct pagemapread {
> > char __user *out, *end;
> >  };
> >  
> > -#define PM_ENTRY_BYTES sizeof(u64)
> > -#define PM_RESERVED_BITS3
> > -#define PM_RESERVED_OFFSET  (64 - PM_RESERVED_BITS)
> > -#define PM_RESERVED_MASK(((1LL< > PM_RESERVED_OFFSET)
> > -#define PM_SPECIAL(nr)  (((nr) << PM_RESERVED_OFFSET) | 
> > PM_RESERVED_MASK)
> > -#define PM_NOT_PRESENT  PM_SPECIAL(1LL)
> > -#define PM_SWAP PM_SPECIAL(2LL)
> > -#define PM_END_OF_BUFFER1
> > -
> > -static int add_to_pagemap(unsigned long addr, u64 pfn,
> > +struct ppte {
> > +   uint64_t paddr:58;
> > +   uint64_t psize:4;
> > +   uint64_t swap:1;
> > +   uint64_t present:1;
> > +};

It'd be nice to keep the current convention, which is to stay away from
bitfields.

> > +#ifdef CONFIG_X86
> > +#define PM_PSIZE_1G  3
> > +#define PM_PSIZE_4M  2
> > +#define PM_PSIZE_2M  1
> > +#endif

I do think this may get goofy in the future, especially for those
architectures which don't have page sizes tied to Linux pagetables.
Tomorrow, you might end up with:

> > +#ifdef CONFIG_FUNNYARCH
> > +#define PM_PSIZE_64M 4 
> > +#define PM_PSIZE_1G  3
> > +#define PM_PSIZE_4M  2
> > +#define PM_PSIZE_2M  1
> > +#endif

> > +#define PM_ENTRY_BYTES   sizeof(struct ppte)
> > +#define PM_END_OF_BUFFER 1
> > +
> > +static int add_to_pagemap(unsigned long addr, struct ppte ppte,
> >   struct pagemapread *pm)
> >  {
> > /*
> > @@ -545,13 +552,13 @@ static int add_to_pagemap(unsigned long addr, u64 pfn,
> >  * the pfn.
> >  */
> > if (pm->out + PM_ENTRY_BYTES >= pm->end) {
> > -   if (copy_to_user(pm->out, , pm->end - pm->out))
> > +   if (copy_to_user(pm->out, , pm->end - pm->out))
> > return -EFAULT;
> > pm->out = pm->end;
> > return PM_END_OF_BUFFER;
> > }
> >  
> > -   if (put_user(pfn, pm->out))
> > +   if (copy_to_user(pm->out, , sizeof(ppte)))
> > return -EFAULT;
> > pm->out += PM_ENTRY_BYTES;
> > return 0;
> > @@ -564,7 +571,7 @@ static int pagemap_pte_hole(unsigned long start, 
> > unsigned long end,
> > unsigned long addr;
> > int err = 0;
> > for (addr = start; addr < end; addr += PAGE_SIZE) {
> > -   err = add_to_pagemap(addr, PM_NOT_PRESENT, pm);
> > +   err = 

Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

2008-02-23 Thread Matt Mackall

On Sat, 2008-02-23 at 00:06 -0800, Andrew Morton wrote:
> On Wed, 20 Feb 2008 14:57:43 +0100 "Hans Rosenfeld" <[EMAIL PROTECTED]> wrote:
> 
> > The current code for /proc/pid/pagemap does not work with huge pages (on
> > x86). The code will make no difference between a normal pmd and a huge
> > page pmd, trying to parse the contents of the huge page as ptes. Another
> > problem is that there is no way to get information about the page size a
> > specific mapping uses.
> > 
> > Also, the current way the "not present" and "swap" bits are encoded in
> > the returned pfn isn't very clean, especially not if this interface is
> > going to be extended.
> > 
> > I propose to change /proc/pid/pagemap to return a pseudo-pte instead of
> > just a raw pfn. The pseudo-pte will contain:
> > 
> > - 58 bits for the physical address of the first byte in the page, even
> >   less bits would probably be sufficient for quite a while
> > 
> > - 4 bits for the page size, with 0 meaning native page size (4k on x86,
> >   8k on alpha, ...) and values 1-15 being specific to the architecture
> >   (I used 1 for 2M, 2 for 4M and 3 for 1G for x86)
> > 
> > - a "swap" bit indicating that a not present page is paged out, with the
> >   physical address field containing page file number and block number
> >   just like before
> > 
> > - a "present" bit just like in a real pte
> >   
> > By shortening the field for the physical address, some more interesting
> > information could be included, like read/write permissions and the like.
> > The page size could also be returned directly, 6 bits could be used to
> > express any page shift in a 64 bit system, but I found the encoded page
> > size more useful for my specific use case.
> > 
> > 
> > The attached patch changes the /proc/pid/pagemap code to use such a
> > pseudo-pte. The huge page handling is currently limited to 2M/4M pages
> > on x86, 1G pages will need some more work. To keep the simple mapping of
> > virtual addresses to file index intact, any huge page pseudo-pte is
> > replicated in the user buffer to map the equivalent range of small
> > pages. 
> > 
> > Note that I had to move the pmd_pfn() macro from asm-x86/pgtable_64.h to
> > asm-x86/pgtable.h, it applies to both 32 bit and 64 bit x86.
> > 
> > Other architectures will probably need other changes to support huge
> > pages and return the page size.
> > 
> > I think that the definition of the pseudo-pte structure and the page
> > size codes should be made available through a header file, but I didn't
> > do this for now.
> > 
> 
> If we're going to do this, we need to do it *fast*.  Once 2.6.25 goes out
> our hands are tied.
> 
> That means talking with the maintainers of other hugepage-capable
> architectures.
> 
> > +struct ppte {
> > +   uint64_t paddr:58;
> > +   uint64_t psize:4;
> > +   uint64_t swap:1;
> > +   uint64_t present:1;
> > +};
> 
> This is part of the exported kernel interface and hence should be in a
> header somewhere, shouldn't it?  The old stuff should have been too.

I think we're better off not using bitfields here.

> u64 is a bit more conventional than uint64_t, and if we move this to a
> userspace-visible header then __u64 is the type to use, I think.  Although
> one would expect uint64_t to be OK as well.
> 
> > +#ifdef CONFIG_X86
> > +#define PM_PSIZE_1G  3
> > +#define PM_PSIZE_4M  2
> > +#define PM_PSIZE_2M  1
> > +#endif
> 
> No, we should factor this correctly and get the CONFIG_X86 stuff out of here.

Perhaps my "continuation bit" idea.

> Matt?  Help?

Did my previous message make it out? This is probably my last message
for 24+ hours.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

2008-02-23 Thread Andrew Morton
On Wed, 20 Feb 2008 14:57:43 +0100 "Hans Rosenfeld" <[EMAIL PROTECTED]> wrote:

> The current code for /proc/pid/pagemap does not work with huge pages (on
> x86). The code will make no difference between a normal pmd and a huge
> page pmd, trying to parse the contents of the huge page as ptes. Another
> problem is that there is no way to get information about the page size a
> specific mapping uses.
> 
> Also, the current way the "not present" and "swap" bits are encoded in
> the returned pfn isn't very clean, especially not if this interface is
> going to be extended.
> 
> I propose to change /proc/pid/pagemap to return a pseudo-pte instead of
> just a raw pfn. The pseudo-pte will contain:
> 
> - 58 bits for the physical address of the first byte in the page, even
>   less bits would probably be sufficient for quite a while
> 
> - 4 bits for the page size, with 0 meaning native page size (4k on x86,
>   8k on alpha, ...) and values 1-15 being specific to the architecture
>   (I used 1 for 2M, 2 for 4M and 3 for 1G for x86)
> 
> - a "swap" bit indicating that a not present page is paged out, with the
>   physical address field containing page file number and block number
>   just like before
> 
> - a "present" bit just like in a real pte
>   
> By shortening the field for the physical address, some more interesting
> information could be included, like read/write permissions and the like.
> The page size could also be returned directly, 6 bits could be used to
> express any page shift in a 64 bit system, but I found the encoded page
> size more useful for my specific use case.
> 
> 
> The attached patch changes the /proc/pid/pagemap code to use such a
> pseudo-pte. The huge page handling is currently limited to 2M/4M pages
> on x86, 1G pages will need some more work. To keep the simple mapping of
> virtual addresses to file index intact, any huge page pseudo-pte is
> replicated in the user buffer to map the equivalent range of small
> pages. 
> 
> Note that I had to move the pmd_pfn() macro from asm-x86/pgtable_64.h to
> asm-x86/pgtable.h, it applies to both 32 bit and 64 bit x86.
> 
> Other architectures will probably need other changes to support huge
> pages and return the page size.
> 
> I think that the definition of the pseudo-pte structure and the page
> size codes should be made available through a header file, but I didn't
> do this for now.
> 

If we're going to do this, we need to do it *fast*.  Once 2.6.25 goes out
our hands are tied.

That means talking with the maintainers of other hugepage-capable
architectures.

> +struct ppte {
> + uint64_t paddr:58;
> + uint64_t psize:4;
> + uint64_t swap:1;
> + uint64_t present:1;
> +};

This is part of the exported kernel interface and hence should be in a
header somewhere, shouldn't it?  The old stuff should have been too.

u64 is a bit more conventional than uint64_t, and if we move this to a
userspace-visible header then __u64 is the type to use, I think.  Although
one would expect uint64_t to be OK as well.

> +#ifdef CONFIG_X86
> +#define PM_PSIZE_1G  3
> +#define PM_PSIZE_4M  2
> +#define PM_PSIZE_2M  1
> +#endif

No, we should factor this correctly and get the CONFIG_X86 stuff out of here.

> +#define PM_ENTRY_BYTES   sizeof(struct ppte)
> +#define PM_END_OF_BUFFER 1
> +
> +static int add_to_pagemap(unsigned long addr, struct ppte ppte,
> struct pagemapread *pm)
>  {
>   /*
> @@ -545,13 +552,13 @@ static int add_to_pagemap(unsigned long addr, u64 pfn,
>* the pfn.
>*/
>   if (pm->out + PM_ENTRY_BYTES >= pm->end) {
> - if (copy_to_user(pm->out, , pm->end - pm->out))
> + if (copy_to_user(pm->out, , pm->end - pm->out))
>   return -EFAULT;
>   pm->out = pm->end;
>   return PM_END_OF_BUFFER;
>   }
>  
> - if (put_user(pfn, pm->out))
> + if (copy_to_user(pm->out, , sizeof(ppte)))
>   return -EFAULT;
>   pm->out += PM_ENTRY_BYTES;
>   return 0;
> @@ -564,7 +571,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned 
> long end,
>   unsigned long addr;
>   int err = 0;
>   for (addr = start; addr < end; addr += PAGE_SIZE) {
> - err = add_to_pagemap(addr, PM_NOT_PRESENT, pm);
> + err = add_to_pagemap(addr, (struct ppte) {0, 0, 0, 0}, pm);
>   if (err)
>   break;
>   }
> @@ -574,7 +581,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned 
> long end,
>  u64 swap_pte_to_pagemap_entry(pte_t pte)
>  {
>   swp_entry_t e = pte_to_swp_entry(pte);
> - return PM_SWAP | swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
> + return swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
>  }
>  
>  static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long 
> end,
> @@ -584,16 +591,37 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long 
> addr, unsigned long end,
>   

Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

2008-02-23 Thread Andrew Morton
On Wed, 20 Feb 2008 14:57:43 +0100 Hans Rosenfeld [EMAIL PROTECTED] wrote:

 The current code for /proc/pid/pagemap does not work with huge pages (on
 x86). The code will make no difference between a normal pmd and a huge
 page pmd, trying to parse the contents of the huge page as ptes. Another
 problem is that there is no way to get information about the page size a
 specific mapping uses.
 
 Also, the current way the not present and swap bits are encoded in
 the returned pfn isn't very clean, especially not if this interface is
 going to be extended.
 
 I propose to change /proc/pid/pagemap to return a pseudo-pte instead of
 just a raw pfn. The pseudo-pte will contain:
 
 - 58 bits for the physical address of the first byte in the page, even
   less bits would probably be sufficient for quite a while
 
 - 4 bits for the page size, with 0 meaning native page size (4k on x86,
   8k on alpha, ...) and values 1-15 being specific to the architecture
   (I used 1 for 2M, 2 for 4M and 3 for 1G for x86)
 
 - a swap bit indicating that a not present page is paged out, with the
   physical address field containing page file number and block number
   just like before
 
 - a present bit just like in a real pte
   
 By shortening the field for the physical address, some more interesting
 information could be included, like read/write permissions and the like.
 The page size could also be returned directly, 6 bits could be used to
 express any page shift in a 64 bit system, but I found the encoded page
 size more useful for my specific use case.
 
 
 The attached patch changes the /proc/pid/pagemap code to use such a
 pseudo-pte. The huge page handling is currently limited to 2M/4M pages
 on x86, 1G pages will need some more work. To keep the simple mapping of
 virtual addresses to file index intact, any huge page pseudo-pte is
 replicated in the user buffer to map the equivalent range of small
 pages. 
 
 Note that I had to move the pmd_pfn() macro from asm-x86/pgtable_64.h to
 asm-x86/pgtable.h, it applies to both 32 bit and 64 bit x86.
 
 Other architectures will probably need other changes to support huge
 pages and return the page size.
 
 I think that the definition of the pseudo-pte structure and the page
 size codes should be made available through a header file, but I didn't
 do this for now.
 

If we're going to do this, we need to do it *fast*.  Once 2.6.25 goes out
our hands are tied.

That means talking with the maintainers of other hugepage-capable
architectures.

 +struct ppte {
 + uint64_t paddr:58;
 + uint64_t psize:4;
 + uint64_t swap:1;
 + uint64_t present:1;
 +};

This is part of the exported kernel interface and hence should be in a
header somewhere, shouldn't it?  The old stuff should have been too.

u64 is a bit more conventional than uint64_t, and if we move this to a
userspace-visible header then __u64 is the type to use, I think.  Although
one would expect uint64_t to be OK as well.

 +#ifdef CONFIG_X86
 +#define PM_PSIZE_1G  3
 +#define PM_PSIZE_4M  2
 +#define PM_PSIZE_2M  1
 +#endif

No, we should factor this correctly and get the CONFIG_X86 stuff out of here.

 +#define PM_ENTRY_BYTES   sizeof(struct ppte)
 +#define PM_END_OF_BUFFER 1
 +
 +static int add_to_pagemap(unsigned long addr, struct ppte ppte,
 struct pagemapread *pm)
  {
   /*
 @@ -545,13 +552,13 @@ static int add_to_pagemap(unsigned long addr, u64 pfn,
* the pfn.
*/
   if (pm-out + PM_ENTRY_BYTES = pm-end) {
 - if (copy_to_user(pm-out, pfn, pm-end - pm-out))
 + if (copy_to_user(pm-out, ppte, pm-end - pm-out))
   return -EFAULT;
   pm-out = pm-end;
   return PM_END_OF_BUFFER;
   }
  
 - if (put_user(pfn, pm-out))
 + if (copy_to_user(pm-out, ppte, sizeof(ppte)))
   return -EFAULT;
   pm-out += PM_ENTRY_BYTES;
   return 0;
 @@ -564,7 +571,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned 
 long end,
   unsigned long addr;
   int err = 0;
   for (addr = start; addr  end; addr += PAGE_SIZE) {
 - err = add_to_pagemap(addr, PM_NOT_PRESENT, pm);
 + err = add_to_pagemap(addr, (struct ppte) {0, 0, 0, 0}, pm);
   if (err)
   break;
   }
 @@ -574,7 +581,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned 
 long end,
  u64 swap_pte_to_pagemap_entry(pte_t pte)
  {
   swp_entry_t e = pte_to_swp_entry(pte);
 - return PM_SWAP | swp_type(e) | (swp_offset(e)  MAX_SWAPFILES_SHIFT);
 + return swp_type(e) | (swp_offset(e)  MAX_SWAPFILES_SHIFT);
  }
  
  static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long 
 end,
 @@ -584,16 +591,37 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long 
 addr, unsigned long end,
   pte_t *pte;
   int err = 0;
  
 +#ifdef CONFIG_X86
 + if (pmd_huge(*pmd)) {
 + struct ppte ppte = { 
 + 

Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

2008-02-23 Thread Matt Mackall

On Sat, 2008-02-23 at 00:06 -0800, Andrew Morton wrote:
 On Wed, 20 Feb 2008 14:57:43 +0100 Hans Rosenfeld [EMAIL PROTECTED] wrote:
 
  The current code for /proc/pid/pagemap does not work with huge pages (on
  x86). The code will make no difference between a normal pmd and a huge
  page pmd, trying to parse the contents of the huge page as ptes. Another
  problem is that there is no way to get information about the page size a
  specific mapping uses.
  
  Also, the current way the not present and swap bits are encoded in
  the returned pfn isn't very clean, especially not if this interface is
  going to be extended.
  
  I propose to change /proc/pid/pagemap to return a pseudo-pte instead of
  just a raw pfn. The pseudo-pte will contain:
  
  - 58 bits for the physical address of the first byte in the page, even
less bits would probably be sufficient for quite a while
  
  - 4 bits for the page size, with 0 meaning native page size (4k on x86,
8k on alpha, ...) and values 1-15 being specific to the architecture
(I used 1 for 2M, 2 for 4M and 3 for 1G for x86)
  
  - a swap bit indicating that a not present page is paged out, with the
physical address field containing page file number and block number
just like before
  
  - a present bit just like in a real pte

  By shortening the field for the physical address, some more interesting
  information could be included, like read/write permissions and the like.
  The page size could also be returned directly, 6 bits could be used to
  express any page shift in a 64 bit system, but I found the encoded page
  size more useful for my specific use case.
  
  
  The attached patch changes the /proc/pid/pagemap code to use such a
  pseudo-pte. The huge page handling is currently limited to 2M/4M pages
  on x86, 1G pages will need some more work. To keep the simple mapping of
  virtual addresses to file index intact, any huge page pseudo-pte is
  replicated in the user buffer to map the equivalent range of small
  pages. 
  
  Note that I had to move the pmd_pfn() macro from asm-x86/pgtable_64.h to
  asm-x86/pgtable.h, it applies to both 32 bit and 64 bit x86.
  
  Other architectures will probably need other changes to support huge
  pages and return the page size.
  
  I think that the definition of the pseudo-pte structure and the page
  size codes should be made available through a header file, but I didn't
  do this for now.
  
 
 If we're going to do this, we need to do it *fast*.  Once 2.6.25 goes out
 our hands are tied.
 
 That means talking with the maintainers of other hugepage-capable
 architectures.
 
  +struct ppte {
  +   uint64_t paddr:58;
  +   uint64_t psize:4;
  +   uint64_t swap:1;
  +   uint64_t present:1;
  +};
 
 This is part of the exported kernel interface and hence should be in a
 header somewhere, shouldn't it?  The old stuff should have been too.

I think we're better off not using bitfields here.

 u64 is a bit more conventional than uint64_t, and if we move this to a
 userspace-visible header then __u64 is the type to use, I think.  Although
 one would expect uint64_t to be OK as well.
 
  +#ifdef CONFIG_X86
  +#define PM_PSIZE_1G  3
  +#define PM_PSIZE_4M  2
  +#define PM_PSIZE_2M  1
  +#endif
 
 No, we should factor this correctly and get the CONFIG_X86 stuff out of here.

Perhaps my continuation bit idea.

 Matt?  Help?

Did my previous message make it out? This is probably my last message
for 24+ hours.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

2008-02-23 Thread Dave Hansen
On Sat, 2008-02-23 at 10:18 +0800, Matt Mackall wrote:
 Another
  problem is that there is no way to get information about the page size a
  specific mapping uses.

Is this true generically, or just with pagemap?  It seems like we should
have a way to tell that a particular mapping is of large pages.  I'm
cc'ing a few folks who might know.

  Also, the current way the not present and swap bits are encoded in
  the returned pfn isn't very clean, especially not if this interface is
  going to be extended.
 
 Fair.

Yup.

  I propose to change /proc/pid/pagemap to return a pseudo-pte instead of
  just a raw pfn. The pseudo-pte will contain:
  
  - 58 bits for the physical address of the first byte in the page, even
less bits would probably be sufficient for quite a while

Well, whether we use a physical address of the first byte of the page or
a pfn doesn't really matter.  It just boils down to whether we use low
or high bits for the magic. :)

  - 4 bits for the page size, with 0 meaning native page size (4k on x86,
8k on alpha, ...) and values 1-15 being specific to the architecture
(I used 1 for 2M, 2 for 4M and 3 for 1G for x86)

Native page size probably a bad idea.  ppc64 can use 64k or 4k for its
native page size and has 16MB large pages (as well as some others).
To make it even more confusing, you can have a 64k kernel page size with
4k mmu mappings!

That said, this is a decent idea as long as we know that nobody will
ever have more than 16 page sizes.  

  - a swap bit indicating that a not present page is paged out, with the
physical address field containing page file number and block number
just like before
  
  - a present bit just like in a real pte
 
 This is ok-ish, but I can't say I like it much. Especially the page size
 field.
 
 But I don't really have many ideas here. Perhaps having a bit saying
 this entry is really a continuation of the previous one. Then any page
 size can be trivially represented. This might also make the code on both
 sides simpler?

Yeah, it could just be a special flag plus a mask or offset showing how
many entries to back up to find the actual mapping.  If each huge page
entry just had something along the lines of:

PAGEMAP_HUGE_PAGE_BIT | HPAGE_MASK

You can see its a huge mapping from the bit, and you can go find the
physical page by applying HPAGE_MASK to your current position in the
pagemap.

  diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
  index 49958cf..58af588 100644
  --- a/fs/proc/task_mmu.c
  +++ b/fs/proc/task_mmu.c
  @@ -527,16 +527,23 @@ struct pagemapread {
  char __user *out, *end;
   };
   
  -#define PM_ENTRY_BYTES sizeof(u64)
  -#define PM_RESERVED_BITS3
  -#define PM_RESERVED_OFFSET  (64 - PM_RESERVED_BITS)
  -#define PM_RESERVED_MASK(((1LLPM_RESERVED_BITS)-1)  
  PM_RESERVED_OFFSET)
  -#define PM_SPECIAL(nr)  (((nr)  PM_RESERVED_OFFSET) | 
  PM_RESERVED_MASK)
  -#define PM_NOT_PRESENT  PM_SPECIAL(1LL)
  -#define PM_SWAP PM_SPECIAL(2LL)
  -#define PM_END_OF_BUFFER1
  -
  -static int add_to_pagemap(unsigned long addr, u64 pfn,
  +struct ppte {
  +   uint64_t paddr:58;
  +   uint64_t psize:4;
  +   uint64_t swap:1;
  +   uint64_t present:1;
  +};

It'd be nice to keep the current convention, which is to stay away from
bitfields.

  +#ifdef CONFIG_X86
  +#define PM_PSIZE_1G  3
  +#define PM_PSIZE_4M  2
  +#define PM_PSIZE_2M  1
  +#endif

I do think this may get goofy in the future, especially for those
architectures which don't have page sizes tied to Linux pagetables.
Tomorrow, you might end up with:

  +#ifdef CONFIG_FUNNYARCH
  +#define PM_PSIZE_64M 4 
  +#define PM_PSIZE_1G  3
  +#define PM_PSIZE_4M  2
  +#define PM_PSIZE_2M  1
  +#endif

  +#define PM_ENTRY_BYTES   sizeof(struct ppte)
  +#define PM_END_OF_BUFFER 1
  +
  +static int add_to_pagemap(unsigned long addr, struct ppte ppte,
struct pagemapread *pm)
   {
  /*
  @@ -545,13 +552,13 @@ static int add_to_pagemap(unsigned long addr, u64 pfn,
   * the pfn.
   */
  if (pm-out + PM_ENTRY_BYTES = pm-end) {
  -   if (copy_to_user(pm-out, pfn, pm-end - pm-out))
  +   if (copy_to_user(pm-out, ppte, pm-end - pm-out))
  return -EFAULT;
  pm-out = pm-end;
  return PM_END_OF_BUFFER;
  }
   
  -   if (put_user(pfn, pm-out))
  +   if (copy_to_user(pm-out, ppte, sizeof(ppte)))
  return -EFAULT;
  pm-out += PM_ENTRY_BYTES;
  return 0;
  @@ -564,7 +571,7 @@ static int pagemap_pte_hole(unsigned long start, 
  unsigned long end,
  unsigned long addr;
  int err = 0;
  for (addr = start; addr  end; addr += PAGE_SIZE) {
  -   err = add_to_pagemap(addr, PM_NOT_PRESENT, pm);
  +   err = add_to_pagemap(addr, (struct ppte) {0, 0, 0, 0}, pm);
  if (err)
  break;
  }
  @@ -574,7 +581,7 @@ static int pagemap_pte_hole(unsigned long 

Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

2008-02-22 Thread Matt Mackall
(sorry for the delay, travelling)

On Wed, 2008-02-20 at 14:57 +0100, Hans Rosenfeld wrote:
> The current code for /proc/pid/pagemap does not work with huge pages (on
> x86). The code will make no difference between a normal pmd and a huge
> page pmd, trying to parse the contents of the huge page as ptes. Another
> problem is that there is no way to get information about the page size a
> specific mapping uses.
> 
> Also, the current way the "not present" and "swap" bits are encoded in
> the returned pfn isn't very clean, especially not if this interface is
> going to be extended.

Fair.

> I propose to change /proc/pid/pagemap to return a pseudo-pte instead of
> just a raw pfn. The pseudo-pte will contain:
> 
> - 58 bits for the physical address of the first byte in the page, even
>   less bits would probably be sufficient for quite a while
> 
> - 4 bits for the page size, with 0 meaning native page size (4k on x86,
>   8k on alpha, ...) and values 1-15 being specific to the architecture
>   (I used 1 for 2M, 2 for 4M and 3 for 1G for x86)
> 
> - a "swap" bit indicating that a not present page is paged out, with the
>   physical address field containing page file number and block number
>   just like before
> 
> - a "present" bit just like in a real pte

This is ok-ish, but I can't say I like it much. Especially the page size
field.

But I don't really have many ideas here. Perhaps having a bit saying
"this entry is really a continuation of the previous one". Then any page
size can be trivially represented. This might also make the code on both
sides simpler?
  
> By shortening the field for the physical address, some more interesting
> information could be included, like read/write permissions and the like.
> The page size could also be returned directly, 6 bits could be used to
> express any page shift in a 64 bit system, but I found the encoded page
> size more useful for my specific use case.
> 
> 
> The attached patch changes the /proc/pid/pagemap code to use such a
> pseudo-pte. The huge page handling is currently limited to 2M/4M pages
> on x86, 1G pages will need some more work. To keep the simple mapping of
> virtual addresses to file index intact, any huge page pseudo-pte is
> replicated in the user buffer to map the equivalent range of small
> pages. 
> 
> Note that I had to move the pmd_pfn() macro from asm-x86/pgtable_64.h to
> asm-x86/pgtable.h, it applies to both 32 bit and 64 bit x86.
> 
> Other architectures will probably need other changes to support huge
> pages and return the page size.
> 
> I think that the definition of the pseudo-pte structure and the page
> size codes should be made available through a header file, but I didn't
> do this for now.
> 
> Signed-Off-By: Hans Rosenfeld <[EMAIL PROTECTED]>
> 
> ---
>  fs/proc/task_mmu.c   |   68 +
>  include/asm-x86/pgtable.h|2 +
>  include/asm-x86/pgtable_64.h |1 -
>  3 files changed, 50 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 49958cf..58af588 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -527,16 +527,23 @@ struct pagemapread {
>   char __user *out, *end;
>  };
>  
> -#define PM_ENTRY_BYTES sizeof(u64)
> -#define PM_RESERVED_BITS3
> -#define PM_RESERVED_OFFSET  (64 - PM_RESERVED_BITS)
> -#define PM_RESERVED_MASK(((1LL< PM_RESERVED_OFFSET)
> -#define PM_SPECIAL(nr)  (((nr) << PM_RESERVED_OFFSET) | PM_RESERVED_MASK)
> -#define PM_NOT_PRESENT  PM_SPECIAL(1LL)
> -#define PM_SWAP PM_SPECIAL(2LL)
> -#define PM_END_OF_BUFFER1
> -
> -static int add_to_pagemap(unsigned long addr, u64 pfn,
> +struct ppte {
> + uint64_t paddr:58;
> + uint64_t psize:4;
> + uint64_t swap:1;
> + uint64_t present:1;
> +};
> +
> +#ifdef CONFIG_X86
> +#define PM_PSIZE_1G  3
> +#define PM_PSIZE_4M  2
> +#define PM_PSIZE_2M  1
> +#endif
> +
> +#define PM_ENTRY_BYTES   sizeof(struct ppte)
> +#define PM_END_OF_BUFFER 1
> +
> +static int add_to_pagemap(unsigned long addr, struct ppte ppte,
> struct pagemapread *pm)
>  {
>   /*
> @@ -545,13 +552,13 @@ static int add_to_pagemap(unsigned long addr, u64 pfn,
>* the pfn.
>*/
>   if (pm->out + PM_ENTRY_BYTES >= pm->end) {
> - if (copy_to_user(pm->out, , pm->end - pm->out))
> + if (copy_to_user(pm->out, , pm->end - pm->out))
>   return -EFAULT;
>   pm->out = pm->end;
>   return PM_END_OF_BUFFER;
>   }
>  
> - if (put_user(pfn, pm->out))
> + if (copy_to_user(pm->out, , sizeof(ppte)))
>   return -EFAULT;
>   pm->out += PM_ENTRY_BYTES;
>   return 0;
> @@ -564,7 +571,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned 
> long end,
>   unsigned long addr;
>   int err = 0;
>   for (addr = start; addr < end; addr += PAGE_SIZE) {
> - err = add_to_pagemap(addr, 

Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

2008-02-22 Thread Matt Mackall
(sorry for the delay, travelling)

On Wed, 2008-02-20 at 14:57 +0100, Hans Rosenfeld wrote:
 The current code for /proc/pid/pagemap does not work with huge pages (on
 x86). The code will make no difference between a normal pmd and a huge
 page pmd, trying to parse the contents of the huge page as ptes. Another
 problem is that there is no way to get information about the page size a
 specific mapping uses.
 
 Also, the current way the not present and swap bits are encoded in
 the returned pfn isn't very clean, especially not if this interface is
 going to be extended.

Fair.

 I propose to change /proc/pid/pagemap to return a pseudo-pte instead of
 just a raw pfn. The pseudo-pte will contain:
 
 - 58 bits for the physical address of the first byte in the page, even
   less bits would probably be sufficient for quite a while
 
 - 4 bits for the page size, with 0 meaning native page size (4k on x86,
   8k on alpha, ...) and values 1-15 being specific to the architecture
   (I used 1 for 2M, 2 for 4M and 3 for 1G for x86)
 
 - a swap bit indicating that a not present page is paged out, with the
   physical address field containing page file number and block number
   just like before
 
 - a present bit just like in a real pte

This is ok-ish, but I can't say I like it much. Especially the page size
field.

But I don't really have many ideas here. Perhaps having a bit saying
this entry is really a continuation of the previous one. Then any page
size can be trivially represented. This might also make the code on both
sides simpler?
  
 By shortening the field for the physical address, some more interesting
 information could be included, like read/write permissions and the like.
 The page size could also be returned directly, 6 bits could be used to
 express any page shift in a 64 bit system, but I found the encoded page
 size more useful for my specific use case.
 
 
 The attached patch changes the /proc/pid/pagemap code to use such a
 pseudo-pte. The huge page handling is currently limited to 2M/4M pages
 on x86, 1G pages will need some more work. To keep the simple mapping of
 virtual addresses to file index intact, any huge page pseudo-pte is
 replicated in the user buffer to map the equivalent range of small
 pages. 
 
 Note that I had to move the pmd_pfn() macro from asm-x86/pgtable_64.h to
 asm-x86/pgtable.h, it applies to both 32 bit and 64 bit x86.
 
 Other architectures will probably need other changes to support huge
 pages and return the page size.
 
 I think that the definition of the pseudo-pte structure and the page
 size codes should be made available through a header file, but I didn't
 do this for now.
 
 Signed-Off-By: Hans Rosenfeld [EMAIL PROTECTED]
 
 ---
  fs/proc/task_mmu.c   |   68 +
  include/asm-x86/pgtable.h|2 +
  include/asm-x86/pgtable_64.h |1 -
  3 files changed, 50 insertions(+), 21 deletions(-)
 
 diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
 index 49958cf..58af588 100644
 --- a/fs/proc/task_mmu.c
 +++ b/fs/proc/task_mmu.c
 @@ -527,16 +527,23 @@ struct pagemapread {
   char __user *out, *end;
  };
  
 -#define PM_ENTRY_BYTES sizeof(u64)
 -#define PM_RESERVED_BITS3
 -#define PM_RESERVED_OFFSET  (64 - PM_RESERVED_BITS)
 -#define PM_RESERVED_MASK(((1LLPM_RESERVED_BITS)-1)  
 PM_RESERVED_OFFSET)
 -#define PM_SPECIAL(nr)  (((nr)  PM_RESERVED_OFFSET) | PM_RESERVED_MASK)
 -#define PM_NOT_PRESENT  PM_SPECIAL(1LL)
 -#define PM_SWAP PM_SPECIAL(2LL)
 -#define PM_END_OF_BUFFER1
 -
 -static int add_to_pagemap(unsigned long addr, u64 pfn,
 +struct ppte {
 + uint64_t paddr:58;
 + uint64_t psize:4;
 + uint64_t swap:1;
 + uint64_t present:1;
 +};
 +
 +#ifdef CONFIG_X86
 +#define PM_PSIZE_1G  3
 +#define PM_PSIZE_4M  2
 +#define PM_PSIZE_2M  1
 +#endif
 +
 +#define PM_ENTRY_BYTES   sizeof(struct ppte)
 +#define PM_END_OF_BUFFER 1
 +
 +static int add_to_pagemap(unsigned long addr, struct ppte ppte,
 struct pagemapread *pm)
  {
   /*
 @@ -545,13 +552,13 @@ static int add_to_pagemap(unsigned long addr, u64 pfn,
* the pfn.
*/
   if (pm-out + PM_ENTRY_BYTES = pm-end) {
 - if (copy_to_user(pm-out, pfn, pm-end - pm-out))
 + if (copy_to_user(pm-out, ppte, pm-end - pm-out))
   return -EFAULT;
   pm-out = pm-end;
   return PM_END_OF_BUFFER;
   }
  
 - if (put_user(pfn, pm-out))
 + if (copy_to_user(pm-out, ppte, sizeof(ppte)))
   return -EFAULT;
   pm-out += PM_ENTRY_BYTES;
   return 0;
 @@ -564,7 +571,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned 
 long end,
   unsigned long addr;
   int err = 0;
   for (addr = start; addr  end; addr += PAGE_SIZE) {
 - err = add_to_pagemap(addr, PM_NOT_PRESENT, pm);
 + err = add_to_pagemap(addr, (struct ppte) {0, 0, 0, 0}, pm);
   if 

[RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

2008-02-20 Thread Hans Rosenfeld
The current code for /proc/pid/pagemap does not work with huge pages (on
x86). The code will make no difference between a normal pmd and a huge
page pmd, trying to parse the contents of the huge page as ptes. Another
problem is that there is no way to get information about the page size a
specific mapping uses.

Also, the current way the "not present" and "swap" bits are encoded in
the returned pfn isn't very clean, especially not if this interface is
going to be extended.

I propose to change /proc/pid/pagemap to return a pseudo-pte instead of
just a raw pfn. The pseudo-pte will contain:

- 58 bits for the physical address of the first byte in the page, even
  less bits would probably be sufficient for quite a while

- 4 bits for the page size, with 0 meaning native page size (4k on x86,
  8k on alpha, ...) and values 1-15 being specific to the architecture
  (I used 1 for 2M, 2 for 4M and 3 for 1G for x86)

- a "swap" bit indicating that a not present page is paged out, with the
  physical address field containing page file number and block number
  just like before

- a "present" bit just like in a real pte
  
By shortening the field for the physical address, some more interesting
information could be included, like read/write permissions and the like.
The page size could also be returned directly, 6 bits could be used to
express any page shift in a 64 bit system, but I found the encoded page
size more useful for my specific use case.


The attached patch changes the /proc/pid/pagemap code to use such a
pseudo-pte. The huge page handling is currently limited to 2M/4M pages
on x86, 1G pages will need some more work. To keep the simple mapping of
virtual addresses to file index intact, any huge page pseudo-pte is
replicated in the user buffer to map the equivalent range of small
pages. 

Note that I had to move the pmd_pfn() macro from asm-x86/pgtable_64.h to
asm-x86/pgtable.h, it applies to both 32 bit and 64 bit x86.

Other architectures will probably need other changes to support huge
pages and return the page size.

I think that the definition of the pseudo-pte structure and the page
size codes should be made available through a header file, but I didn't
do this for now.

Signed-Off-By: Hans Rosenfeld <[EMAIL PROTECTED]>

---
 fs/proc/task_mmu.c   |   68 +
 include/asm-x86/pgtable.h|2 +
 include/asm-x86/pgtable_64.h |1 -
 3 files changed, 50 insertions(+), 21 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 49958cf..58af588 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -527,16 +527,23 @@ struct pagemapread {
char __user *out, *end;
 };
 
-#define PM_ENTRY_BYTES sizeof(u64)
-#define PM_RESERVED_BITS3
-#define PM_RESERVED_OFFSET  (64 - PM_RESERVED_BITS)
-#define PM_RESERVED_MASK(((1LL= pm->end) {
-   if (copy_to_user(pm->out, , pm->end - pm->out))
+   if (copy_to_user(pm->out, , pm->end - pm->out))
return -EFAULT;
pm->out = pm->end;
return PM_END_OF_BUFFER;
}
 
-   if (put_user(pfn, pm->out))
+   if (copy_to_user(pm->out, , sizeof(ppte)))
return -EFAULT;
pm->out += PM_ENTRY_BYTES;
return 0;
@@ -564,7 +571,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned 
long end,
unsigned long addr;
int err = 0;
for (addr = start; addr < end; addr += PAGE_SIZE) {
-   err = add_to_pagemap(addr, PM_NOT_PRESENT, pm);
+   err = add_to_pagemap(addr, (struct ppte) {0, 0, 0, 0}, pm);
if (err)
break;
}
@@ -574,7 +581,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned 
long end,
 u64 swap_pte_to_pagemap_entry(pte_t pte)
 {
swp_entry_t e = pte_to_swp_entry(pte);
-   return PM_SWAP | swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
+   return swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
 }
 
 static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
@@ -584,16 +591,37 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long 
addr, unsigned long end,
pte_t *pte;
int err = 0;
 
+#ifdef CONFIG_X86
+   if (pmd_huge(*pmd)) {
+   struct ppte ppte = { 
+   .paddr = pmd_pfn(*pmd) << PAGE_SHIFT,
+   .psize = (HPAGE_SHIFT == 22 ?
+ PM_PSIZE_4M : PM_PSIZE_2M),
+   .swap  = 0,
+   .present = 1,
+   };
+
+   for(; addr != end; addr += PAGE_SIZE) {
+   err = add_to_pagemap(addr, ppte, pm);
+   if (err)
+   return err;
+   }
+   } else
+#endif
for (; addr != end; addr += PAGE_SIZE) {
-   u64 pfn = PM_NOT_PRESENT;
+   struct ppte ppte 

[RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

2008-02-20 Thread Hans Rosenfeld
The current code for /proc/pid/pagemap does not work with huge pages (on
x86). The code will make no difference between a normal pmd and a huge
page pmd, trying to parse the contents of the huge page as ptes. Another
problem is that there is no way to get information about the page size a
specific mapping uses.

Also, the current way the not present and swap bits are encoded in
the returned pfn isn't very clean, especially not if this interface is
going to be extended.

I propose to change /proc/pid/pagemap to return a pseudo-pte instead of
just a raw pfn. The pseudo-pte will contain:

- 58 bits for the physical address of the first byte in the page, even
  less bits would probably be sufficient for quite a while

- 4 bits for the page size, with 0 meaning native page size (4k on x86,
  8k on alpha, ...) and values 1-15 being specific to the architecture
  (I used 1 for 2M, 2 for 4M and 3 for 1G for x86)

- a swap bit indicating that a not present page is paged out, with the
  physical address field containing page file number and block number
  just like before

- a present bit just like in a real pte
  
By shortening the field for the physical address, some more interesting
information could be included, like read/write permissions and the like.
The page size could also be returned directly, 6 bits could be used to
express any page shift in a 64 bit system, but I found the encoded page
size more useful for my specific use case.


The attached patch changes the /proc/pid/pagemap code to use such a
pseudo-pte. The huge page handling is currently limited to 2M/4M pages
on x86, 1G pages will need some more work. To keep the simple mapping of
virtual addresses to file index intact, any huge page pseudo-pte is
replicated in the user buffer to map the equivalent range of small
pages. 

Note that I had to move the pmd_pfn() macro from asm-x86/pgtable_64.h to
asm-x86/pgtable.h, it applies to both 32 bit and 64 bit x86.

Other architectures will probably need other changes to support huge
pages and return the page size.

I think that the definition of the pseudo-pte structure and the page
size codes should be made available through a header file, but I didn't
do this for now.

Signed-Off-By: Hans Rosenfeld [EMAIL PROTECTED]

---
 fs/proc/task_mmu.c   |   68 +
 include/asm-x86/pgtable.h|2 +
 include/asm-x86/pgtable_64.h |1 -
 3 files changed, 50 insertions(+), 21 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 49958cf..58af588 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -527,16 +527,23 @@ struct pagemapread {
char __user *out, *end;
 };
 
-#define PM_ENTRY_BYTES sizeof(u64)
-#define PM_RESERVED_BITS3
-#define PM_RESERVED_OFFSET  (64 - PM_RESERVED_BITS)
-#define PM_RESERVED_MASK(((1LLPM_RESERVED_BITS)-1)  PM_RESERVED_OFFSET)
-#define PM_SPECIAL(nr)  (((nr)  PM_RESERVED_OFFSET) | PM_RESERVED_MASK)
-#define PM_NOT_PRESENT  PM_SPECIAL(1LL)
-#define PM_SWAP PM_SPECIAL(2LL)
-#define PM_END_OF_BUFFER1
-
-static int add_to_pagemap(unsigned long addr, u64 pfn,
+struct ppte {
+   uint64_t paddr:58;
+   uint64_t psize:4;
+   uint64_t swap:1;
+   uint64_t present:1;
+};
+
+#ifdef CONFIG_X86
+#define PM_PSIZE_1G  3
+#define PM_PSIZE_4M  2
+#define PM_PSIZE_2M  1
+#endif
+
+#define PM_ENTRY_BYTES   sizeof(struct ppte)
+#define PM_END_OF_BUFFER 1
+
+static int add_to_pagemap(unsigned long addr, struct ppte ppte,
  struct pagemapread *pm)
 {
/*
@@ -545,13 +552,13 @@ static int add_to_pagemap(unsigned long addr, u64 pfn,
 * the pfn.
 */
if (pm-out + PM_ENTRY_BYTES = pm-end) {
-   if (copy_to_user(pm-out, pfn, pm-end - pm-out))
+   if (copy_to_user(pm-out, ppte, pm-end - pm-out))
return -EFAULT;
pm-out = pm-end;
return PM_END_OF_BUFFER;
}
 
-   if (put_user(pfn, pm-out))
+   if (copy_to_user(pm-out, ppte, sizeof(ppte)))
return -EFAULT;
pm-out += PM_ENTRY_BYTES;
return 0;
@@ -564,7 +571,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned 
long end,
unsigned long addr;
int err = 0;
for (addr = start; addr  end; addr += PAGE_SIZE) {
-   err = add_to_pagemap(addr, PM_NOT_PRESENT, pm);
+   err = add_to_pagemap(addr, (struct ppte) {0, 0, 0, 0}, pm);
if (err)
break;
}
@@ -574,7 +581,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned 
long end,
 u64 swap_pte_to_pagemap_entry(pte_t pte)
 {
swp_entry_t e = pte_to_swp_entry(pte);
-   return PM_SWAP | swp_type(e) | (swp_offset(e)  MAX_SWAPFILES_SHIFT);
+   return swp_type(e) | (swp_offset(e)  MAX_SWAPFILES_SHIFT);
 }
 
 static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
@@ -584,16