Re: [RFC]x86: clearing access bit don't flush tlb

2013-01-14 Thread Shaohua Li
On Tue, Jan 08, 2013 at 02:03:25AM -0500, Rik van Riel wrote:
> On 01/08/2013 12:09 AM, H. Peter Anvin wrote:
> >On 01/07/2013 09:08 PM, Rik van Riel wrote:
> >>On 01/08/2013 12:03 AM, H. Peter Anvin wrote:
> >>>On 01/07/2013 08:55 PM, Shaohua Li wrote:
> 
> I searched a little bit, the change (doing TLB flush to clear access
> bit) is
> made between 2.6.7 - 2.6.8, I can't find the changelog, but I found a
> patch:
> http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7-rc2/2.6.7-rc2-mm2/broken-out/mm-flush-tlb-when-clearing-young.patch
> 
> 
> The changelog declaims this is for arm/ppc/ppc64.
> 
> >>>
> >>>Not really.  It says that those have stumbled over it already.  It is
> >>>true in general that this change will make very frequently used pages
> >>>(which stick in the TLB) candidates for eviction.
> >>
> >>That is only true if the pages were to stay in the TLB for a
> >>very very long time.  Probably multiple seconds.
> >>
> >>>x86 would seem to be just as affected, although possibly with a
> >>>different frequency.
> >>>
> >>>Do we have any actual metrics on anything here?
> >>
> >>I suspect that if we do need to force a TLB flush for page
> >>reclaim purposes, it may make sense to do that TLB flush
> >>asynchronously. For example, kswapd could kick off a TLB
> >>flush of every CPU in the system once a second, when the
> >>system is under pageout pressure.
> >>
> >>We would have to do this in a smart way, so the kswapds
> >>from multiple nodes do not duplicate the work.
> >>
> >>If people want that kind of functionality, I would be
> >>happy to cook up an RFC patch.
> >>
> >
> >So it sounds like you're saying that this patch should never have been
> >applied in the first place?
> 
> It made sense at the time.

So you agreed the patch is safe, right?
 
> However, with larger SMP systems, we may need a different
> mechanism to get the TLB flushes done after we clear a bunch
> of accessed bits.
> 
> One thing we could do is mark bits in a bitmap, keeping track
> of which CPUs should have their TLB flushed due to accessed bit
> scanning.
> 
> Then we could set a timer for eg. a 1 second timeout, after
> which the TLB flush IPIs get sent. If the timer is already
> pending, we do not start it, but piggyback on the invocation
> that is already scheduled to happen.
> 
> Does something like that make sense?

I don't understand why larger SMP system matters here. Only if there are enough
TLB entries in CPU matters to me. And if the system is larger, memory is
larger. TLB entries will not be sufficient. Or you are worrying about future
larger SMP system can have very big TLB entries?

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]x86: clearing access bit don't flush tlb

2013-01-14 Thread Shaohua Li
On Tue, Jan 08, 2013 at 02:03:25AM -0500, Rik van Riel wrote:
 On 01/08/2013 12:09 AM, H. Peter Anvin wrote:
 On 01/07/2013 09:08 PM, Rik van Riel wrote:
 On 01/08/2013 12:03 AM, H. Peter Anvin wrote:
 On 01/07/2013 08:55 PM, Shaohua Li wrote:
 
 I searched a little bit, the change (doing TLB flush to clear access
 bit) is
 made between 2.6.7 - 2.6.8, I can't find the changelog, but I found a
 patch:
 http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7-rc2/2.6.7-rc2-mm2/broken-out/mm-flush-tlb-when-clearing-young.patch
 
 
 The changelog declaims this is for arm/ppc/ppc64.
 
 
 Not really.  It says that those have stumbled over it already.  It is
 true in general that this change will make very frequently used pages
 (which stick in the TLB) candidates for eviction.
 
 That is only true if the pages were to stay in the TLB for a
 very very long time.  Probably multiple seconds.
 
 x86 would seem to be just as affected, although possibly with a
 different frequency.
 
 Do we have any actual metrics on anything here?
 
 I suspect that if we do need to force a TLB flush for page
 reclaim purposes, it may make sense to do that TLB flush
 asynchronously. For example, kswapd could kick off a TLB
 flush of every CPU in the system once a second, when the
 system is under pageout pressure.
 
 We would have to do this in a smart way, so the kswapds
 from multiple nodes do not duplicate the work.
 
 If people want that kind of functionality, I would be
 happy to cook up an RFC patch.
 
 
 So it sounds like you're saying that this patch should never have been
 applied in the first place?
 
 It made sense at the time.

So you agreed the patch is safe, right?
 
 However, with larger SMP systems, we may need a different
 mechanism to get the TLB flushes done after we clear a bunch
 of accessed bits.
 
 One thing we could do is mark bits in a bitmap, keeping track
 of which CPUs should have their TLB flushed due to accessed bit
 scanning.
 
 Then we could set a timer for eg. a 1 second timeout, after
 which the TLB flush IPIs get sent. If the timer is already
 pending, we do not start it, but piggyback on the invocation
 that is already scheduled to happen.
 
 Does something like that make sense?

I don't understand why larger SMP system matters here. Only if there are enough
TLB entries in CPU matters to me. And if the system is larger, memory is
larger. TLB entries will not be sufficient. Or you are worrying about future
larger SMP system can have very big TLB entries?

Thanks,
Shaohua
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]x86: clearing access bit don't flush tlb

2013-01-07 Thread Rik van Riel

On 01/08/2013 12:09 AM, H. Peter Anvin wrote:

On 01/07/2013 09:08 PM, Rik van Riel wrote:

On 01/08/2013 12:03 AM, H. Peter Anvin wrote:

On 01/07/2013 08:55 PM, Shaohua Li wrote:


I searched a little bit, the change (doing TLB flush to clear access
bit) is
made between 2.6.7 - 2.6.8, I can't find the changelog, but I found a
patch:
http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7-rc2/2.6.7-rc2-mm2/broken-out/mm-flush-tlb-when-clearing-young.patch


The changelog declaims this is for arm/ppc/ppc64.



Not really.  It says that those have stumbled over it already.  It is
true in general that this change will make very frequently used pages
(which stick in the TLB) candidates for eviction.


That is only true if the pages were to stay in the TLB for a
very very long time.  Probably multiple seconds.


x86 would seem to be just as affected, although possibly with a
different frequency.

Do we have any actual metrics on anything here?


I suspect that if we do need to force a TLB flush for page
reclaim purposes, it may make sense to do that TLB flush
asynchronously. For example, kswapd could kick off a TLB
flush of every CPU in the system once a second, when the
system is under pageout pressure.

We would have to do this in a smart way, so the kswapds
from multiple nodes do not duplicate the work.

If people want that kind of functionality, I would be
happy to cook up an RFC patch.



So it sounds like you're saying that this patch should never have been
applied in the first place?


It made sense at the time.

However, with larger SMP systems, we may need a different
mechanism to get the TLB flushes done after we clear a bunch
of accessed bits.

One thing we could do is mark bits in a bitmap, keeping track
of which CPUs should have their TLB flushed due to accessed bit
scanning.

Then we could set a timer for eg. a 1 second timeout, after
which the TLB flush IPIs get sent. If the timer is already
pending, we do not start it, but piggyback on the invocation
that is already scheduled to happen.

Does something like that make sense?

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]x86: clearing access bit don't flush tlb

2013-01-07 Thread H. Peter Anvin
On 01/07/2013 09:08 PM, Rik van Riel wrote:
> On 01/08/2013 12:03 AM, H. Peter Anvin wrote:
>> On 01/07/2013 08:55 PM, Shaohua Li wrote:
>>>
>>> I searched a little bit, the change (doing TLB flush to clear access
>>> bit) is
>>> made between 2.6.7 - 2.6.8, I can't find the changelog, but I found a
>>> patch:
>>> http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7-rc2/2.6.7-rc2-mm2/broken-out/mm-flush-tlb-when-clearing-young.patch
>>>
>>>
>>> The changelog declaims this is for arm/ppc/ppc64.
>>>
>>
>> Not really.  It says that those have stumbled over it already.  It is
>> true in general that this change will make very frequently used pages
>> (which stick in the TLB) candidates for eviction.
> 
> That is only true if the pages were to stay in the TLB for a
> very very long time.  Probably multiple seconds.
> 
>> x86 would seem to be just as affected, although possibly with a
>> different frequency.
>>
>> Do we have any actual metrics on anything here?
> 
> I suspect that if we do need to force a TLB flush for page
> reclaim purposes, it may make sense to do that TLB flush
> asynchronously. For example, kswapd could kick off a TLB
> flush of every CPU in the system once a second, when the
> system is under pageout pressure.
> 
> We would have to do this in a smart way, so the kswapds
> from multiple nodes do not duplicate the work.
> 
> If people want that kind of functionality, I would be
> happy to cook up an RFC patch.
> 

So it sounds like you're saying that this patch should never have been
applied in the first place?

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]x86: clearing access bit don't flush tlb

2013-01-07 Thread Rik van Riel

On 01/08/2013 12:03 AM, H. Peter Anvin wrote:

On 01/07/2013 08:55 PM, Shaohua Li wrote:


I searched a little bit, the change (doing TLB flush to clear access bit) is
made between 2.6.7 - 2.6.8, I can't find the changelog, but I found a patch:
http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7-rc2/2.6.7-rc2-mm2/broken-out/mm-flush-tlb-when-clearing-young.patch

The changelog declaims this is for arm/ppc/ppc64.



Not really.  It says that those have stumbled over it already.  It is
true in general that this change will make very frequently used pages
(which stick in the TLB) candidates for eviction.


That is only true if the pages were to stay in the TLB for a
very very long time.  Probably multiple seconds.


x86 would seem to be just as affected, although possibly with a
different frequency.

Do we have any actual metrics on anything here?


I suspect that if we do need to force a TLB flush for page
reclaim purposes, it may make sense to do that TLB flush
asynchronously. For example, kswapd could kick off a TLB
flush of every CPU in the system once a second, when the
system is under pageout pressure.

We would have to do this in a smart way, so the kswapds
from multiple nodes do not duplicate the work.

If people want that kind of functionality, I would be
happy to cook up an RFC patch.

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]x86: clearing access bit don't flush tlb

2013-01-07 Thread H. Peter Anvin
On 01/07/2013 08:55 PM, Shaohua Li wrote:
> 
> I searched a little bit, the change (doing TLB flush to clear access bit) is
> made between 2.6.7 - 2.6.8, I can't find the changelog, but I found a patch:
> http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7-rc2/2.6.7-rc2-mm2/broken-out/mm-flush-tlb-when-clearing-young.patch
> 
> The changelog declaims this is for arm/ppc/ppc64.
> 

Not really.  It says that those have stumbled over it already.  It is
true in general that this change will make very frequently used pages
(which stick in the TLB) candidates for eviction.

x86 would seem to be just as affected, although possibly with a
different frequency.

Do we have any actual metrics on anything here?

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]x86: clearing access bit don't flush tlb

2013-01-07 Thread Shaohua Li
On Mon, Jan 07, 2013 at 02:31:21PM -0800, H. Peter Anvin wrote:
> On 01/07/2013 07:14 AM, Rik van Riel wrote:
> > On 01/07/2013 03:12 AM, Shaohua Li wrote:
> >>
> >> We use access bit to age a page at page reclaim. When clearing pte
> >> access bit,
> >> we could skip tlb flush for the virtual address. The side effect is if
> >> the pte
> >> is in tlb and pte access bit is unset, when cpu access the page again,
> >> cpu will
> >> not set pte's access bit. So next time page reclaim can reclaim hot pages
> >> wrongly, but this doesn't corrupt anything. And according to intel
> >> manual, tlb
> >> has less than 1k entries, which coverers < 4M memory. In today's system,
> >> several giga byte memory is normal. After page reclaim clears pte
> >> access bit
> >> and before cpu access the page again, it's quite unlikely this page's
> >> pte is
> >> still in TLB. Skiping the tlb flush for this case sounds ok to me.
> > 
> > Agreed. In current systems, it can take a minute to write
> > all of memory to disk, while context switch (natural TLB
> > flush) times are in the dozens-of-millisecond timeframes.
> > 
> 
> I'm confused.  We used to do this since time immemorial, so if we aren't
> doing that now, that meant something changed somewhere along the line.
> It would be good to figure out if that was an intentional change or
> accidental.

I searched a little bit, the change (doing TLB flush to clear access bit) is
made between 2.6.7 - 2.6.8, I can't find the changelog, but I found a patch:
http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7-rc2/2.6.7-rc2-mm2/broken-out/mm-flush-tlb-when-clearing-young.patch

The changelog declaims this is for arm/ppc/ppc64.

Thanks,
Shaohua

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]x86: clearing access bit don't flush tlb

2013-01-07 Thread Simon Jeons
Hi Shaohua,

On Mon, 2013-01-07 at 16:12 +0800, Shaohua Li wrote:
> We use access bit to age a page at page reclaim. When clearing pte access bit,

Who sets this flag to pte? mmu? tlb?

> we could skip tlb flush for the virtual address. The side effect is if the pte
> is in tlb and pte access bit is unset, when cpu access the page again, cpu 
> will
> not set pte's access bit. So next time page reclaim can reclaim hot pages
> wrongly, but this doesn't corrupt anything. And according to intel manual, tlb
> has less than 1k entries, which coverers < 4M memory. In today's system,
> several giga byte memory is normal. After page reclaim clears pte access bit
> and before cpu access the page again, it's quite unlikely this page's pte is
> still in TLB. Skiping the tlb flush for this case sounds ok to me.
> 

If one page is accessed more frequently than the other page before page
reclaim, page reclaim treat them the same hot according to access flag
since the flag used to age page just at page reclaim. How to handle this
issue?

> And in some workloads, TLB flush overhead is very heavy. In my simple
> multithread app with a lot of swap to several pcie SSD, removing the tlb flush
> gives about 20% ~ 30% swapout speedup.
> 
> Signed-off-by: Shaohua Li 
> ---
>  arch/x86/mm/pgtable.c |7 +--
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> Index: linux/arch/x86/mm/pgtable.c
> ===
> --- linux.orig/arch/x86/mm/pgtable.c  2012-12-17 16:54:37.847770807 +0800
> +++ linux/arch/x86/mm/pgtable.c   2013-01-07 14:59:40.898066357 +0800
> @@ -376,13 +376,8 @@ int pmdp_test_and_clear_young(struct vm_
>  int ptep_clear_flush_young(struct vm_area_struct *vma,
>  unsigned long address, pte_t *ptep)
>  {
> - int young;
>  
> - young = ptep_test_and_clear_young(vma, address, ptep);
> - if (young)
> - flush_tlb_page(vma, address);
> -
> - return young;
> + return ptep_test_and_clear_young(vma, address, ptep);
>  }
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]x86: clearing access bit don't flush tlb

2013-01-07 Thread H. Peter Anvin
On 01/07/2013 07:14 AM, Rik van Riel wrote:
> On 01/07/2013 03:12 AM, Shaohua Li wrote:
>>
>> We use access bit to age a page at page reclaim. When clearing pte
>> access bit,
>> we could skip tlb flush for the virtual address. The side effect is if
>> the pte
>> is in tlb and pte access bit is unset, when cpu access the page again,
>> cpu will
>> not set pte's access bit. So next time page reclaim can reclaim hot pages
>> wrongly, but this doesn't corrupt anything. And according to intel
>> manual, tlb
>> has less than 1k entries, which coverers < 4M memory. In today's system,
>> several giga byte memory is normal. After page reclaim clears pte
>> access bit
>> and before cpu access the page again, it's quite unlikely this page's
>> pte is
>> still in TLB. Skiping the tlb flush for this case sounds ok to me.
> 
> Agreed. In current systems, it can take a minute to write
> all of memory to disk, while context switch (natural TLB
> flush) times are in the dozens-of-millisecond timeframes.
> 

I'm confused.  We used to do this since time immemorial, so if we aren't
doing that now, that meant something changed somewhere along the line.
It would be good to figure out if that was an intentional change or
accidental.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]x86: clearing access bit don't flush tlb

2013-01-07 Thread Rik van Riel

On 01/07/2013 03:12 AM, Shaohua Li wrote:


We use access bit to age a page at page reclaim. When clearing pte access bit,
we could skip tlb flush for the virtual address. The side effect is if the pte
is in tlb and pte access bit is unset, when cpu access the page again, cpu will
not set pte's access bit. So next time page reclaim can reclaim hot pages
wrongly, but this doesn't corrupt anything. And according to intel manual, tlb
has less than 1k entries, which coverers < 4M memory. In today's system,
several giga byte memory is normal. After page reclaim clears pte access bit
and before cpu access the page again, it's quite unlikely this page's pte is
still in TLB. Skiping the tlb flush for this case sounds ok to me.


Agreed. In current systems, it can take a minute to write
all of memory to disk, while context switch (natural TLB
flush) times are in the dozens-of-millisecond timeframes.


And in some workloads, TLB flush overhead is very heavy. In my simple
multithread app with a lot of swap to several pcie SSD, removing the tlb flush
gives about 20% ~ 30% swapout speedup.

Signed-off-by: Shaohua Li 


Reviewed-by: Rik van Riel 


--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC]x86: clearing access bit don't flush tlb

2013-01-07 Thread Shaohua Li

We use access bit to age a page at page reclaim. When clearing pte access bit,
we could skip tlb flush for the virtual address. The side effect is if the pte
is in tlb and pte access bit is unset, when cpu access the page again, cpu will
not set pte's access bit. So next time page reclaim can reclaim hot pages
wrongly, but this doesn't corrupt anything. And according to intel manual, tlb
has less than 1k entries, which coverers < 4M memory. In today's system,
several giga byte memory is normal. After page reclaim clears pte access bit
and before cpu access the page again, it's quite unlikely this page's pte is
still in TLB. Skiping the tlb flush for this case sounds ok to me.

And in some workloads, TLB flush overhead is very heavy. In my simple
multithread app with a lot of swap to several pcie SSD, removing the tlb flush
gives about 20% ~ 30% swapout speedup.

Signed-off-by: Shaohua Li 
---
 arch/x86/mm/pgtable.c |7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

Index: linux/arch/x86/mm/pgtable.c
===
--- linux.orig/arch/x86/mm/pgtable.c2012-12-17 16:54:37.847770807 +0800
+++ linux/arch/x86/mm/pgtable.c 2013-01-07 14:59:40.898066357 +0800
@@ -376,13 +376,8 @@ int pmdp_test_and_clear_young(struct vm_
 int ptep_clear_flush_young(struct vm_area_struct *vma,
   unsigned long address, pte_t *ptep)
 {
-   int young;
 
-   young = ptep_test_and_clear_young(vma, address, ptep);
-   if (young)
-   flush_tlb_page(vma, address);
-
-   return young;
+   return ptep_test_and_clear_young(vma, address, ptep);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC]x86: clearing access bit don't flush tlb

2013-01-07 Thread Shaohua Li

We use access bit to age a page at page reclaim. When clearing pte access bit,
we could skip tlb flush for the virtual address. The side effect is if the pte
is in tlb and pte access bit is unset, when cpu access the page again, cpu will
not set pte's access bit. So next time page reclaim can reclaim hot pages
wrongly, but this doesn't corrupt anything. And according to intel manual, tlb
has less than 1k entries, which coverers  4M memory. In today's system,
several giga byte memory is normal. After page reclaim clears pte access bit
and before cpu access the page again, it's quite unlikely this page's pte is
still in TLB. Skiping the tlb flush for this case sounds ok to me.

And in some workloads, TLB flush overhead is very heavy. In my simple
multithread app with a lot of swap to several pcie SSD, removing the tlb flush
gives about 20% ~ 30% swapout speedup.

Signed-off-by: Shaohua Li s...@fusionio.com
---
 arch/x86/mm/pgtable.c |7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

Index: linux/arch/x86/mm/pgtable.c
===
--- linux.orig/arch/x86/mm/pgtable.c2012-12-17 16:54:37.847770807 +0800
+++ linux/arch/x86/mm/pgtable.c 2013-01-07 14:59:40.898066357 +0800
@@ -376,13 +376,8 @@ int pmdp_test_and_clear_young(struct vm_
 int ptep_clear_flush_young(struct vm_area_struct *vma,
   unsigned long address, pte_t *ptep)
 {
-   int young;
 
-   young = ptep_test_and_clear_young(vma, address, ptep);
-   if (young)
-   flush_tlb_page(vma, address);
-
-   return young;
+   return ptep_test_and_clear_young(vma, address, ptep);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]x86: clearing access bit don't flush tlb

2013-01-07 Thread Rik van Riel

On 01/07/2013 03:12 AM, Shaohua Li wrote:


We use access bit to age a page at page reclaim. When clearing pte access bit,
we could skip tlb flush for the virtual address. The side effect is if the pte
is in tlb and pte access bit is unset, when cpu access the page again, cpu will
not set pte's access bit. So next time page reclaim can reclaim hot pages
wrongly, but this doesn't corrupt anything. And according to intel manual, tlb
has less than 1k entries, which coverers  4M memory. In today's system,
several giga byte memory is normal. After page reclaim clears pte access bit
and before cpu access the page again, it's quite unlikely this page's pte is
still in TLB. Skiping the tlb flush for this case sounds ok to me.


Agreed. In current systems, it can take a minute to write
all of memory to disk, while context switch (natural TLB
flush) times are in the dozens-of-millisecond timeframes.


And in some workloads, TLB flush overhead is very heavy. In my simple
multithread app with a lot of swap to several pcie SSD, removing the tlb flush
gives about 20% ~ 30% swapout speedup.

Signed-off-by: Shaohua Li s...@fusionio.com


Reviewed-by: Rik van Riel r...@redhat.com


--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]x86: clearing access bit don't flush tlb

2013-01-07 Thread H. Peter Anvin
On 01/07/2013 07:14 AM, Rik van Riel wrote:
 On 01/07/2013 03:12 AM, Shaohua Li wrote:

 We use access bit to age a page at page reclaim. When clearing pte
 access bit,
 we could skip tlb flush for the virtual address. The side effect is if
 the pte
 is in tlb and pte access bit is unset, when cpu access the page again,
 cpu will
 not set pte's access bit. So next time page reclaim can reclaim hot pages
 wrongly, but this doesn't corrupt anything. And according to intel
 manual, tlb
 has less than 1k entries, which coverers  4M memory. In today's system,
 several giga byte memory is normal. After page reclaim clears pte
 access bit
 and before cpu access the page again, it's quite unlikely this page's
 pte is
 still in TLB. Skiping the tlb flush for this case sounds ok to me.
 
 Agreed. In current systems, it can take a minute to write
 all of memory to disk, while context switch (natural TLB
 flush) times are in the dozens-of-millisecond timeframes.
 

I'm confused.  We used to do this since time immemorial, so if we aren't
doing that now, that meant something changed somewhere along the line.
It would be good to figure out if that was an intentional change or
accidental.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]x86: clearing access bit don't flush tlb

2013-01-07 Thread Simon Jeons
Hi Shaohua,

On Mon, 2013-01-07 at 16:12 +0800, Shaohua Li wrote:
 We use access bit to age a page at page reclaim. When clearing pte access bit,

Who sets this flag to pte? mmu? tlb?

 we could skip tlb flush for the virtual address. The side effect is if the pte
 is in tlb and pte access bit is unset, when cpu access the page again, cpu 
 will
 not set pte's access bit. So next time page reclaim can reclaim hot pages
 wrongly, but this doesn't corrupt anything. And according to intel manual, tlb
 has less than 1k entries, which coverers  4M memory. In today's system,
 several giga byte memory is normal. After page reclaim clears pte access bit
 and before cpu access the page again, it's quite unlikely this page's pte is
 still in TLB. Skiping the tlb flush for this case sounds ok to me.
 

If one page is accessed more frequently than the other page before page
reclaim, page reclaim treat them the same hot according to access flag
since the flag used to age page just at page reclaim. How to handle this
issue?

 And in some workloads, TLB flush overhead is very heavy. In my simple
 multithread app with a lot of swap to several pcie SSD, removing the tlb flush
 gives about 20% ~ 30% swapout speedup.
 
 Signed-off-by: Shaohua Li s...@fusionio.com
 ---
  arch/x86/mm/pgtable.c |7 +--
  1 file changed, 1 insertion(+), 6 deletions(-)
 
 Index: linux/arch/x86/mm/pgtable.c
 ===
 --- linux.orig/arch/x86/mm/pgtable.c  2012-12-17 16:54:37.847770807 +0800
 +++ linux/arch/x86/mm/pgtable.c   2013-01-07 14:59:40.898066357 +0800
 @@ -376,13 +376,8 @@ int pmdp_test_and_clear_young(struct vm_
  int ptep_clear_flush_young(struct vm_area_struct *vma,
  unsigned long address, pte_t *ptep)
  {
 - int young;
  
 - young = ptep_test_and_clear_young(vma, address, ptep);
 - if (young)
 - flush_tlb_page(vma, address);
 -
 - return young;
 + return ptep_test_and_clear_young(vma, address, ptep);
  }
  
  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]x86: clearing access bit don't flush tlb

2013-01-07 Thread Shaohua Li
On Mon, Jan 07, 2013 at 02:31:21PM -0800, H. Peter Anvin wrote:
 On 01/07/2013 07:14 AM, Rik van Riel wrote:
  On 01/07/2013 03:12 AM, Shaohua Li wrote:
 
  We use access bit to age a page at page reclaim. When clearing pte
  access bit,
  we could skip tlb flush for the virtual address. The side effect is if
  the pte
  is in tlb and pte access bit is unset, when cpu access the page again,
  cpu will
  not set pte's access bit. So next time page reclaim can reclaim hot pages
  wrongly, but this doesn't corrupt anything. And according to intel
  manual, tlb
  has less than 1k entries, which coverers  4M memory. In today's system,
  several giga byte memory is normal. After page reclaim clears pte
  access bit
  and before cpu access the page again, it's quite unlikely this page's
  pte is
  still in TLB. Skiping the tlb flush for this case sounds ok to me.
  
  Agreed. In current systems, it can take a minute to write
  all of memory to disk, while context switch (natural TLB
  flush) times are in the dozens-of-millisecond timeframes.
  
 
 I'm confused.  We used to do this since time immemorial, so if we aren't
 doing that now, that meant something changed somewhere along the line.
 It would be good to figure out if that was an intentional change or
 accidental.

I searched a little bit, the change (doing TLB flush to clear access bit) is
made between 2.6.7 - 2.6.8, I can't find the changelog, but I found a patch:
http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7-rc2/2.6.7-rc2-mm2/broken-out/mm-flush-tlb-when-clearing-young.patch

The changelog declaims this is for arm/ppc/ppc64.

Thanks,
Shaohua

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]x86: clearing access bit don't flush tlb

2013-01-07 Thread H. Peter Anvin
On 01/07/2013 08:55 PM, Shaohua Li wrote:
 
 I searched a little bit, the change (doing TLB flush to clear access bit) is
 made between 2.6.7 - 2.6.8, I can't find the changelog, but I found a patch:
 http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7-rc2/2.6.7-rc2-mm2/broken-out/mm-flush-tlb-when-clearing-young.patch
 
 The changelog declaims this is for arm/ppc/ppc64.
 

Not really.  It says that those have stumbled over it already.  It is
true in general that this change will make very frequently used pages
(which stick in the TLB) candidates for eviction.

x86 would seem to be just as affected, although possibly with a
different frequency.

Do we have any actual metrics on anything here?

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]x86: clearing access bit don't flush tlb

2013-01-07 Thread Rik van Riel

On 01/08/2013 12:03 AM, H. Peter Anvin wrote:

On 01/07/2013 08:55 PM, Shaohua Li wrote:


I searched a little bit, the change (doing TLB flush to clear access bit) is
made between 2.6.7 - 2.6.8, I can't find the changelog, but I found a patch:
http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7-rc2/2.6.7-rc2-mm2/broken-out/mm-flush-tlb-when-clearing-young.patch

The changelog declaims this is for arm/ppc/ppc64.



Not really.  It says that those have stumbled over it already.  It is
true in general that this change will make very frequently used pages
(which stick in the TLB) candidates for eviction.


That is only true if the pages were to stay in the TLB for a
very very long time.  Probably multiple seconds.


x86 would seem to be just as affected, although possibly with a
different frequency.

Do we have any actual metrics on anything here?


I suspect that if we do need to force a TLB flush for page
reclaim purposes, it may make sense to do that TLB flush
asynchronously. For example, kswapd could kick off a TLB
flush of every CPU in the system once a second, when the
system is under pageout pressure.

We would have to do this in a smart way, so the kswapds
from multiple nodes do not duplicate the work.

If people want that kind of functionality, I would be
happy to cook up an RFC patch.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]x86: clearing access bit don't flush tlb

2013-01-07 Thread H. Peter Anvin
On 01/07/2013 09:08 PM, Rik van Riel wrote:
 On 01/08/2013 12:03 AM, H. Peter Anvin wrote:
 On 01/07/2013 08:55 PM, Shaohua Li wrote:

 I searched a little bit, the change (doing TLB flush to clear access
 bit) is
 made between 2.6.7 - 2.6.8, I can't find the changelog, but I found a
 patch:
 http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7-rc2/2.6.7-rc2-mm2/broken-out/mm-flush-tlb-when-clearing-young.patch


 The changelog declaims this is for arm/ppc/ppc64.


 Not really.  It says that those have stumbled over it already.  It is
 true in general that this change will make very frequently used pages
 (which stick in the TLB) candidates for eviction.
 
 That is only true if the pages were to stay in the TLB for a
 very very long time.  Probably multiple seconds.
 
 x86 would seem to be just as affected, although possibly with a
 different frequency.

 Do we have any actual metrics on anything here?
 
 I suspect that if we do need to force a TLB flush for page
 reclaim purposes, it may make sense to do that TLB flush
 asynchronously. For example, kswapd could kick off a TLB
 flush of every CPU in the system once a second, when the
 system is under pageout pressure.
 
 We would have to do this in a smart way, so the kswapds
 from multiple nodes do not duplicate the work.
 
 If people want that kind of functionality, I would be
 happy to cook up an RFC patch.
 

So it sounds like you're saying that this patch should never have been
applied in the first place?

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]x86: clearing access bit don't flush tlb

2013-01-07 Thread Rik van Riel

On 01/08/2013 12:09 AM, H. Peter Anvin wrote:

On 01/07/2013 09:08 PM, Rik van Riel wrote:

On 01/08/2013 12:03 AM, H. Peter Anvin wrote:

On 01/07/2013 08:55 PM, Shaohua Li wrote:


I searched a little bit, the change (doing TLB flush to clear access
bit) is
made between 2.6.7 - 2.6.8, I can't find the changelog, but I found a
patch:
http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7-rc2/2.6.7-rc2-mm2/broken-out/mm-flush-tlb-when-clearing-young.patch


The changelog declaims this is for arm/ppc/ppc64.



Not really.  It says that those have stumbled over it already.  It is
true in general that this change will make very frequently used pages
(which stick in the TLB) candidates for eviction.


That is only true if the pages were to stay in the TLB for a
very very long time.  Probably multiple seconds.


x86 would seem to be just as affected, although possibly with a
different frequency.

Do we have any actual metrics on anything here?


I suspect that if we do need to force a TLB flush for page
reclaim purposes, it may make sense to do that TLB flush
asynchronously. For example, kswapd could kick off a TLB
flush of every CPU in the system once a second, when the
system is under pageout pressure.

We would have to do this in a smart way, so the kswapds
from multiple nodes do not duplicate the work.

If people want that kind of functionality, I would be
happy to cook up an RFC patch.



So it sounds like you're saying that this patch should never have been
applied in the first place?


It made sense at the time.

However, with larger SMP systems, we may need a different
mechanism to get the TLB flushes done after we clear a bunch
of accessed bits.

One thing we could do is mark bits in a bitmap, keeping track
of which CPUs should have their TLB flushed due to accessed bit
scanning.

Then we could set a timer for eg. a 1 second timeout, after
which the TLB flush IPIs get sent. If the timer is already
pending, we do not start it, but piggyback on the invocation
that is already scheduled to happen.

Does something like that make sense?

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/