Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-27 Thread Benjamin Herrenschmidt
On Wed, 2013-08-21 at 09:30 -0700, Linus Torvalds wrote:
> I will be reverting the whole soft-dirty mess. I thought the
> bit-mapping games it played were already too complicated (the patch to
> pgtable-2level.h in commit 41bb3476b361 just makes me want to barf and
> came in very late, so I'm not positive about the whole soft-dirty mess
> in the first place). I really am not at all inclined to want to play
> games in this area any more. It's too damn late in the release window.

Anything that makes me try to scavenge a new PTE bits makes me
scream :-) Dunno if I'll manage to support this on power.

Also, it sort-of duplicates what KVM does for dirty tracking (for
migration, framebuffer updates, etc...). I wonder if KVM could consider
switching to this scheme, but then we end up with a user "break KVM"
file in /proc since the user can clear the refs.

I'd have been happier if the whole thing had essentially used a parallel
set of dirty tracking bitmaps (hooked up with the VMAs maybe). Add the
overhead there for as many "clients" as you want who will use the
facility and leave the PTE mostly alone basically. (I suppose we still
need to play PTE tricks to differenciate soft dirty RO vs. COW RO on
anonymous memory ?)

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-27 Thread Benjamin Herrenschmidt
On Wed, 2013-08-21 at 09:30 -0700, Linus Torvalds wrote:
 I will be reverting the whole soft-dirty mess. I thought the
 bit-mapping games it played were already too complicated (the patch to
 pgtable-2level.h in commit 41bb3476b361 just makes me want to barf and
 came in very late, so I'm not positive about the whole soft-dirty mess
 in the first place). I really am not at all inclined to want to play
 games in this area any more. It's too damn late in the release window.

Anything that makes me try to scavenge a new PTE bits makes me
scream :-) Dunno if I'll manage to support this on power.

Also, it sort-of duplicates what KVM does for dirty tracking (for
migration, framebuffer updates, etc...). I wonder if KVM could consider
switching to this scheme, but then we end up with a user break KVM
file in /proc since the user can clear the refs.

I'd have been happier if the whole thing had essentially used a parallel
set of dirty tracking bitmaps (hooked up with the VMAs maybe). Add the
overhead there for as many clients as you want who will use the
facility and leave the PTE mostly alone basically. (I suppose we still
need to play PTE tricks to differenciate soft dirty RO vs. COW RO on
anonymous memory ?)

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Cyrill Gorcunov
On Thu, Aug 22, 2013 at 03:37:33PM +0900, Minchan Kim wrote:
> > 
> > Look, good thing is that 7th bit also available on the 4level pages
> > (ie x86-64) without additional code modification, that's why I picked
> > it in first place. I prepared the patch locally which doesn't use
> > pse bit for tracking but it only makes code more complex.
> 
> I'm not sure you read this thread.
> http://comments.gmane.org/gmane.linux.kernel.mm/101756
> In summary, I'd like to use it to track sub-ranges of some processes.
> 
> I already had a time to investigate and it enhanced our workload x2 on ARM
> so I'd like to expand the concept for more general purpose.
> 
> For it, arch-specific stuff would be hurdle for port.
> So, I support your non-arch-specific solutions.
> Just FYI. Such future plan shouldn't force you.

Hi Minchan, thanks for info, I'll read the thread (sorry for delay,
managed to miss your email).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Pavel Emelyanov
On 08/22/2013 03:33 PM, Jan Beulich wrote:

>> From: Cyrill Gorcunov 
>> Subject: [PATCH] mm: Make sure _PAGE_SWP_SOFT_DIRTY bit is not set on 
>> present pte
>>
>> _PAGE_SOFT_DIRTY bit should never be set on present pte so add
>> VM_BUG_ON to catch any potential future abuse.
>>
>> Also add a comment on _PAGE_SWP_SOFT_DIRTY definition explaining
>> scope of its usage.
>>
>> Signed-off-by: Cyrill Gorcunov 
> 
> Acked-by: Jan Beulich 

Acked-by: Pavel Emelyanov 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Jan Beulich
>>> On 22.08.13 at 13:27, Cyrill Gorcunov  wrote:
> On Thu, Aug 22, 2013 at 08:27:45AM +0100, Jan Beulich wrote:
>> >>> On 22.08.13 at 09:03, Cyrill Gorcunov  wrote:
>> > Ok, how about this?
>> > 
>> > static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
>> > {
>> >BUG_ON(pte_present(pte));
>> >return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
>> > }
>> 
>> Sure, fine with me. Perhaps VM_BUG_ON() or some other similar
>> construct limiting the scope when any extra code gets generated
>> would do too.
> 
> Sorry for delay, the patch is below.
> 
>> 
>> But as said, even better would perhaps be to have it act on a
>> swp_entry_t.
> 
> swp_entry_t is too small already to keep additional status bit,
> unfortunately.
> ---
> From: Cyrill Gorcunov 
> Subject: [PATCH] mm: Make sure _PAGE_SWP_SOFT_DIRTY bit is not set on 
> present pte
> 
> _PAGE_SOFT_DIRTY bit should never be set on present pte so add
> VM_BUG_ON to catch any potential future abuse.
> 
> Also add a comment on _PAGE_SWP_SOFT_DIRTY definition explaining
> scope of its usage.
> 
> Signed-off-by: Cyrill Gorcunov 

Acked-by: Jan Beulich 

Thanks, Jan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Cyrill Gorcunov
On Thu, Aug 22, 2013 at 08:27:45AM +0100, Jan Beulich wrote:
> >>> On 22.08.13 at 09:03, Cyrill Gorcunov  wrote:
> > Ok, how about this?
> > 
> > static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
> > {
> > BUG_ON(pte_present(pte));
> > return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
> > }
> 
> Sure, fine with me. Perhaps VM_BUG_ON() or some other similar
> construct limiting the scope when any extra code gets generated
> would do too.

Sorry for delay, the patch is below.

> 
> But as said, even better would perhaps be to have it act on a
> swp_entry_t.

swp_entry_t is too small already to keep additional status bit,
unfortunately.
---
From: Cyrill Gorcunov 
Subject: [PATCH] mm: Make sure _PAGE_SWP_SOFT_DIRTY bit is not set on present 
pte

_PAGE_SOFT_DIRTY bit should never be set on present pte so add
VM_BUG_ON to catch any potential future abuse.

Also add a comment on _PAGE_SWP_SOFT_DIRTY definition explaining
scope of its usage.

Signed-off-by: Cyrill Gorcunov 
Cc: Andy Lutomirski 
Cc: Pavel Emelyanov 
Cc: Andrew Morton 
Cc: Matt Mackall 
Cc: Xiao Guangrong 
Cc: Marcelo Tosatti 
Cc: KOSAKI Motohiro 
Cc: Stephen Rothwell 
Cc: Peter Zijlstra 
Cc: "Aneesh Kumar K.V" 
Cc: David Vrabel 
Cc: Linus Torvalds 
Cc: Konrad Rzeszutek Wilk 
Cc: Boris Ostrovsky 
Cc: Jan Beulich 
---
 arch/x86/include/asm/pgtable.h   |   34 +++---
 arch/x86/include/asm/pgtable_types.h |3 +++
 2 files changed, 22 insertions(+), 15 deletions(-)

Index: linux-2.6.git/arch/x86/include/asm/pgtable.h
===
--- linux-2.6.git.orig/arch/x86/include/asm/pgtable.h
+++ linux-2.6.git/arch/x86/include/asm/pgtable.h
@@ -314,21 +314,6 @@ static inline pmd_t pmd_mksoft_dirty(pmd
return pmd_set_flags(pmd, _PAGE_SOFT_DIRTY);
 }
 
-static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
-{
-   return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
-}
-
-static inline int pte_swp_soft_dirty(pte_t pte)
-{
-   return pte_flags(pte) & _PAGE_SWP_SOFT_DIRTY;
-}
-
-static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
-{
-   return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
-}
-
 static inline pte_t pte_file_clear_soft_dirty(pte_t pte)
 {
return pte_clear_flags(pte, _PAGE_SOFT_DIRTY);
@@ -445,6 +430,7 @@ pte_t *populate_extra_pte(unsigned long
 
 #ifndef __ASSEMBLY__
 #include 
+#include 
 #include 
 
 static inline int pte_none(pte_t pte)
@@ -863,6 +849,24 @@ static inline void update_mmu_cache_pmd(
 {
 }
 
+static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
+{
+   VM_BUG_ON(pte_present(pte));
+   return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
+}
+
+static inline int pte_swp_soft_dirty(pte_t pte)
+{
+   VM_BUG_ON(pte_present(pte));
+   return pte_flags(pte) & _PAGE_SWP_SOFT_DIRTY;
+}
+
+static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
+{
+   VM_BUG_ON(pte_present(pte));
+   return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
+}
+
 #include 
 #endif /* __ASSEMBLY__ */
 
Index: linux-2.6.git/arch/x86/include/asm/pgtable_types.h
===
--- linux-2.6.git.orig/arch/x86/include/asm/pgtable_types.h
+++ linux-2.6.git/arch/x86/include/asm/pgtable_types.h
@@ -75,6 +75,9 @@
  * with swap entry format. On x86 bits 6 and 7 are *not* involved
  * into swap entry computation, but bit 6 is used for nonlinear
  * file mapping, so we borrow bit 7 for soft dirty tracking.
+ *
+ * Please note that this bit must be treated as swap dirty page
+ * mark if and only if the PTE has present bit clear!
  */
 #ifdef CONFIG_MEM_SOFT_DIRTY
 #define _PAGE_SWP_SOFT_DIRTY   _PAGE_PSE
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Pavel Emelyanov
On 08/22/2013 01:32 PM, David Vrabel wrote:
> On 22/08/13 00:04, Linus Torvalds wrote:
>> On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov  wrote:
>>>
>>> I personally don't see bug here because
>>>
>>>  - this swapped page soft dirty bit is set for non-present entries only,
>>>never for present ones, just at moment we form swap pte entry
>>>
>>>  - i don't find any code which would test for this bit directly without
>>>is_swap_pte call
>>
>> Ok, having gone through the places that use swp_*soft_dirty(), I have
>> to agree. Afaik, it's only ever used on a swap-entry that has (by
>> definition) the P bit clear. So with or without Xen, I don't see how
>> it can make any difference.
>>
>> David/Konrad - did you actually see any issues, or was this just from
>> (mis)reading the code?
> 
> There are no Xen related bugs in the code, we were misreading it.
> 
> It was my call to raise this as a regression without a repro and clearly
> this was the wrong decision.
> 
> However, having looked at the soft dirty implementation and specifically
> the userspace ABI I think that it is far to closely coupled to the
> current implementation.  I think this will constrain future development
> of the feature should userspace require a more efficient ABI than
> scanning all of /proc//pagemaps.
> 
> Minimal downtime during 'live' checkpointing of a running task needs the
> checkpointer to find and write out dirty pages faster than the task can
> dirty them.

Absolutely, but in "find and write" the "write" component is likely to take
the majority of time -- we can scan PTEs of a mapping MUCH faster, than
transmitting those over even 10Gbit link.

We actually see this IRL -- in CRIU there's an atomic test, that checks
mappings get dumped and restored properly. One of sub-tests is one 512Mb 
mapping.
With it total dump time _minus_ memory dump time (which includes not only 
pagemap
file scan, but also files, registers, process tree, sessions, etc.) is 
fractions 
of one second, while only the memory dump part's time is several seconds.

That said, the super-fast API for getting "what has changed" is not as tempting
to have as faster network/disk.

What is _more_ time consuming in iterative migration in our case is the need to
re-scan the whole /proc tree to get which processes had died and appeared, mess 
with /proc/pid/fd finding out what files were (re-)opened/closed/changed, 
talking
to sock_diag subsystem for sockets information and alike. However, we haven't 
yet
done careful analysis for what the slowest part is, but pagemap scans is 
definitely
not.

> This seems less likely to be possible if every iteration
> all PTEs have to be scanned by the checkpointer instead of (e.g.,)
> accessing a separate list of dirtied pages.

But we don't scan all the x64 virtual address space's PTEs, instead we first
analyze the /proc/pid/maps and scan only PTEs sitting in private mappings.

> David
> .
> 

Thanks,
Pavel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread David Vrabel
On 22/08/13 00:04, Linus Torvalds wrote:
> On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov  wrote:
>>
>> I personally don't see bug here because
>>
>>  - this swapped page soft dirty bit is set for non-present entries only,
>>never for present ones, just at moment we form swap pte entry
>>
>>  - i don't find any code which would test for this bit directly without
>>is_swap_pte call
> 
> Ok, having gone through the places that use swp_*soft_dirty(), I have
> to agree. Afaik, it's only ever used on a swap-entry that has (by
> definition) the P bit clear. So with or without Xen, I don't see how
> it can make any difference.
> 
> David/Konrad - did you actually see any issues, or was this just from
> (mis)reading the code?

There are no Xen related bugs in the code, we were misreading it.

It was my call to raise this as a regression without a repro and clearly
this was the wrong decision.

However, having looked at the soft dirty implementation and specifically
the userspace ABI I think that it is far to closely coupled to the
current implementation.  I think this will constrain future development
of the feature should userspace require a more efficient ABI than
scanning all of /proc//pagemaps.

Minimal downtime during 'live' checkpointing of a running task needs the
checkpointer to find and write out dirty pages faster than the task can
dirty them.  This seems less likely to be possible if every iteration
all PTEs have to be scanned by the checkpointer instead of (e.g.,)
accessing a separate list of dirtied pages.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Jan Beulich
>>> On 21.08.13 at 19:28, Andy Lutomirski  wrote:
> On Wed, Aug 21, 2013 at 6:48 AM, David Vrabel  wrote:
>> All,
>>
>> 179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
>> PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
>> and _PTE_PAT.
>>
>> With a Xen PV guest, the use of the _PTE_PAT will result in the page
>> having unexpected cachability which will introduce a range of subtle
>> performance and correctness issues.  Xen programs the entry 4 in the PAT
>> table with WC so a page that was previously WB will end up as WC.
>>
> 
> Kind of off topic, but do you have a summary of how Xen uses the high
> PAT bits?  I'm the one who wants WT, and if there's already precedent
> for using the high PAT bits, it'll be helpful.

Xen's public headers have a comment explaining this, with the
main information being this table:

 *  The PAT MSR is as follow (it is a 64-bit value, each entry is 8 bits):
 * PAT4 PAT0
 *   +---++++-+++
 *WC | WC | WB | UC | UC- | WC | WB |  <= Linux
 *   +---++++-+++
 *WC | WT | WB | UC | UC- | WT | WB |  <= BIOS (default when machine boots)
 *   +---++++-+++
 *WC | WP | WC | UC | UC- | WT | WB |  <= Xen
 *   +---++++-+++

i.e. Xen is retaining the BIOS (and legacy from non-PAT times)
meaning of the low four entries, putting WC and WP up into
the high half. The fact that the entry 6 is defined to be WC
is perhaps a mistake - it should really be considered reserved
for an eventual future memory type (just like entry 7). It also
seems like entry 6 is documented incorrectly here for Linux and
BIOS.

Jan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Jan Beulich
>>> On 22.08.13 at 01:04, Linus Torvalds  wrote:
> On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov  wrote:
>>
>> I personally don't see bug here because
>>
>>  - this swapped page soft dirty bit is set for non-present entries only,
>>never for present ones, just at moment we form swap pte entry
>>
>>  - i don't find any code which would test for this bit directly without
>>is_swap_pte call
> 
> Ok, having gone through the places that use swp_*soft_dirty(), I have
> to agree. Afaik, it's only ever used on a swap-entry that has (by
> definition) the P bit clear. So with or without Xen, I don't see how
> it can make any difference.
> 
> David/Konrad - did you actually see any issues, or was this just from
> (mis)reading the code?

It was actually me (mis)reading the code - as pointed out to Cyrill
already, setting _PAGE_PAT in a pte_t without even a comment
saying that this can only ever be done with a non-present entry
made me expect problems on Xen, because it's clear that to date
bare metal Linux doesn't care about the state of _PAGE_PAT in
present entries due to the way the PAT MSR gets set (and hence
quite likely no-one would have noticed the supposed problem
while testing).

So a comment either alongside the definition of _PAGE_SWP_SOFT_DIRTY
or directly in pte_swp_{mk,clear_}soft_dirty() would have been
the minimal thing I'd have expected for this sort of re-use of bits.
Ideally even a VM_BUG_ON(pte_present()) or similar. And perhaps
pte_swp_soft_dirty() should be either looking at the present bit
too or similarly asserting that it's clear...

In any event - I'm sorry for the red herring.

Jan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Jan Beulich
>>> On 22.08.13 at 09:03, Cyrill Gorcunov  wrote:
> Ok, how about this?
> 
> static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
> {
>   BUG_ON(pte_present(pte));
>   return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
> }

Sure, fine with me. Perhaps VM_BUG_ON() or some other similar
construct limiting the scope when any extra code gets generated
would do too.

But as said, even better would perhaps be to have it act on a
swp_entry_t.

Jan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Cyrill Gorcunov
On Thu, Aug 22, 2013 at 07:56:26AM +0100, Jan Beulich wrote:
> >>> On 21.08.13 at 18:19, Cyrill Gorcunov  wrote:
> > On Wed, Aug 21, 2013 at 05:03:13PM +0100, Jan Beulich wrote:
> >> > 
> >> > Only to non-present ptes, as far as I know.
> >> 
> >> That's not really any guarantee. And the accessor functions also
> >> don't check that they'd be used on non-present PTEs only.
> > 
> > Wait. This _PAGE_SWP_SOFT_DIRTY bit (which is in real PSE bit) assigned
> > in only one place -- in try_to_unmap_one(). The PTE get non-present then
> > and consists of swap entry format. I don't see any accessor to such entry
> > without testing if it's swap entry or pte-none. What I'm missing?
> 
> Fact is that this
> 
> static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
> {
>   return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
> }
> 
> has no checking whatsoever that the PTE being modified is a
> non-present one, not even in any of the debugging modes. It
> would be a different thing if the above acted on a swp_entry_t.
> 
> The fact that there currently may be just a single call site (where
> the caller guarantees the non-present state) is no guarantee that
> in the future another one won't appear, and then result in very
> hard to debug problems.

Ok, how about this?

static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
{
BUG_ON(pte_present(pte));
return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Jan Beulich
>>> On 21.08.13 at 18:19, Cyrill Gorcunov  wrote:
> On Wed, Aug 21, 2013 at 05:03:13PM +0100, Jan Beulich wrote:
>> > 
>> > Only to non-present ptes, as far as I know.
>> 
>> That's not really any guarantee. And the accessor functions also
>> don't check that they'd be used on non-present PTEs only.
> 
> Wait. This _PAGE_SWP_SOFT_DIRTY bit (which is in real PSE bit) assigned
> in only one place -- in try_to_unmap_one(). The PTE get non-present then
> and consists of swap entry format. I don't see any accessor to such entry
> without testing if it's swap entry or pte-none. What I'm missing?

Fact is that this

static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
{
return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
}

has no checking whatsoever that the PTE being modified is a
non-present one, not even in any of the debugging modes. It
would be a different thing if the above acted on a swp_entry_t.

The fact that there currently may be just a single call site (where
the caller guarantees the non-present state) is no guarantee that
in the future another one won't appear, and then result in very
hard to debug problems.

Jan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Pavel Emelyanov
On 08/22/2013 04:51 AM, Dave Jones wrote:
> On Wed, Aug 21, 2013 at 04:04:54PM -0700, Linus Torvalds wrote:
>  > On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov  
> wrote:
>  > >
>  > > I personally don't see bug here because
>  > >
>  > >  - this swapped page soft dirty bit is set for non-present entries only,
>  > >never for present ones, just at moment we form swap pte entry
>  > >
>  > >  - i don't find any code which would test for this bit directly without
>  > >is_swap_pte call
>  > 
>  > Ok, having gone through the places that use swp_*soft_dirty(), I have
>  > to agree. Afaik, it's only ever used on a swap-entry that has (by
>  > definition) the P bit clear. So with or without Xen, I don't see how
>  > it can make any difference.
>  > 
>  > David/Konrad - did you actually see any issues, or was this just from
>  > (mis)reading the code?
> 
> Could this explain what I'm seeing in another thread ?
> https://lkml.org/lkml/2013/8/7/27

Was it caught with CONFIG_MEM_SOFT_DIRTY on or off? In the latter case all new 
bits manipulations are no-op and couldn't cause this.

>   Dave

Thanks,
Pavel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Minchan Kim
Hello Cyrill,

On Thu, Aug 22, 2013 at 09:49:19AM +0400, Cyrill Gorcunov wrote:
> On Wed, Aug 21, 2013 at 04:42:53PM -0700, Andi Kleen wrote:
> > Cyrill Gorcunov  writes:
> > >
> > > Hi all, I worked on patch which would not touch PSE bit for dirty page
> > > tracking and the result is not that good:
> > >
> > >  - 2level pages now always page dirty if page is swapped in and out, 
> > > because
> > >there is no space left in PTE (other than PSE bit)
> > 
> > Maybe just don't support soft dirty for 2 level page tables?
> > 
> > 2 level page tables should be really on the way out anyways, as they
> > have severe limits and do not support NX. With 3 levels there is enough
> > space.
> 
> Look, good thing is that 7th bit also available on the 4level pages
> (ie x86-64) without additional code modification, that's why I picked
> it in first place. I prepared the patch locally which doesn't use
> pse bit for tracking but it only makes code more complex.

I'm not sure you read this thread.
http://comments.gmane.org/gmane.linux.kernel.mm/101756
In summary, I'd like to use it to track sub-ranges of some processes.

I already had a time to investigate and it enhanced our workload x2 on ARM
so I'd like to expand the concept for more general purpose.

For it, arch-specific stuff would be hurdle for port.
So, I support your non-arch-specific solutions.
Just FYI. Such future plan shouldn't force you.

Thanks.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Minchan Kim
Hello Cyrill,

On Thu, Aug 22, 2013 at 09:49:19AM +0400, Cyrill Gorcunov wrote:
 On Wed, Aug 21, 2013 at 04:42:53PM -0700, Andi Kleen wrote:
  Cyrill Gorcunov gorcu...@gmail.com writes:
  
   Hi all, I worked on patch which would not touch PSE bit for dirty page
   tracking and the result is not that good:
  
- 2level pages now always page dirty if page is swapped in and out, 
   because
  there is no space left in PTE (other than PSE bit)
  
  Maybe just don't support soft dirty for 2 level page tables?
  
  2 level page tables should be really on the way out anyways, as they
  have severe limits and do not support NX. With 3 levels there is enough
  space.
 
 Look, good thing is that 7th bit also available on the 4level pages
 (ie x86-64) without additional code modification, that's why I picked
 it in first place. I prepared the patch locally which doesn't use
 pse bit for tracking but it only makes code more complex.

I'm not sure you read this thread.
http://comments.gmane.org/gmane.linux.kernel.mm/101756
In summary, I'd like to use it to track sub-ranges of some processes.

I already had a time to investigate and it enhanced our workload x2 on ARM
so I'd like to expand the concept for more general purpose.

For it, arch-specific stuff would be hurdle for port.
So, I support your non-arch-specific solutions.
Just FYI. Such future plan shouldn't force you.

Thanks.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Pavel Emelyanov
On 08/22/2013 04:51 AM, Dave Jones wrote:
 On Wed, Aug 21, 2013 at 04:04:54PM -0700, Linus Torvalds wrote:
   On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov gorcu...@gmail.com 
 wrote:
   
I personally don't see bug here because
   
 - this swapped page soft dirty bit is set for non-present entries only,
   never for present ones, just at moment we form swap pte entry
   
 - i don't find any code which would test for this bit directly without
   is_swap_pte call
   
   Ok, having gone through the places that use swp_*soft_dirty(), I have
   to agree. Afaik, it's only ever used on a swap-entry that has (by
   definition) the P bit clear. So with or without Xen, I don't see how
   it can make any difference.
   
   David/Konrad - did you actually see any issues, or was this just from
   (mis)reading the code?
 
 Could this explain what I'm seeing in another thread ?
 https://lkml.org/lkml/2013/8/7/27

Was it caught with CONFIG_MEM_SOFT_DIRTY on or off? In the latter case all new 
bits manipulations are no-op and couldn't cause this.

   Dave

Thanks,
Pavel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Jan Beulich
 On 21.08.13 at 18:19, Cyrill Gorcunov gorcu...@gmail.com wrote:
 On Wed, Aug 21, 2013 at 05:03:13PM +0100, Jan Beulich wrote:
  
  Only to non-present ptes, as far as I know.
 
 That's not really any guarantee. And the accessor functions also
 don't check that they'd be used on non-present PTEs only.
 
 Wait. This _PAGE_SWP_SOFT_DIRTY bit (which is in real PSE bit) assigned
 in only one place -- in try_to_unmap_one(). The PTE get non-present then
 and consists of swap entry format. I don't see any accessor to such entry
 without testing if it's swap entry or pte-none. What I'm missing?

Fact is that this

static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
{
return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
}

has no checking whatsoever that the PTE being modified is a
non-present one, not even in any of the debugging modes. It
would be a different thing if the above acted on a swp_entry_t.

The fact that there currently may be just a single call site (where
the caller guarantees the non-present state) is no guarantee that
in the future another one won't appear, and then result in very
hard to debug problems.

Jan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Cyrill Gorcunov
On Thu, Aug 22, 2013 at 07:56:26AM +0100, Jan Beulich wrote:
  On 21.08.13 at 18:19, Cyrill Gorcunov gorcu...@gmail.com wrote:
  On Wed, Aug 21, 2013 at 05:03:13PM +0100, Jan Beulich wrote:
   
   Only to non-present ptes, as far as I know.
  
  That's not really any guarantee. And the accessor functions also
  don't check that they'd be used on non-present PTEs only.
  
  Wait. This _PAGE_SWP_SOFT_DIRTY bit (which is in real PSE bit) assigned
  in only one place -- in try_to_unmap_one(). The PTE get non-present then
  and consists of swap entry format. I don't see any accessor to such entry
  without testing if it's swap entry or pte-none. What I'm missing?
 
 Fact is that this
 
 static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
 {
   return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
 }
 
 has no checking whatsoever that the PTE being modified is a
 non-present one, not even in any of the debugging modes. It
 would be a different thing if the above acted on a swp_entry_t.
 
 The fact that there currently may be just a single call site (where
 the caller guarantees the non-present state) is no guarantee that
 in the future another one won't appear, and then result in very
 hard to debug problems.

Ok, how about this?

static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
{
BUG_ON(pte_present(pte));
return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
}
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Jan Beulich
 On 22.08.13 at 09:03, Cyrill Gorcunov gorcu...@gmail.com wrote:
 Ok, how about this?
 
 static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
 {
   BUG_ON(pte_present(pte));
   return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
 }

Sure, fine with me. Perhaps VM_BUG_ON() or some other similar
construct limiting the scope when any extra code gets generated
would do too.

But as said, even better would perhaps be to have it act on a
swp_entry_t.

Jan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Jan Beulich
 On 22.08.13 at 01:04, Linus Torvalds torva...@linux-foundation.org wrote:
 On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov gorcu...@gmail.com wrote:

 I personally don't see bug here because

  - this swapped page soft dirty bit is set for non-present entries only,
never for present ones, just at moment we form swap pte entry

  - i don't find any code which would test for this bit directly without
is_swap_pte call
 
 Ok, having gone through the places that use swp_*soft_dirty(), I have
 to agree. Afaik, it's only ever used on a swap-entry that has (by
 definition) the P bit clear. So with or without Xen, I don't see how
 it can make any difference.
 
 David/Konrad - did you actually see any issues, or was this just from
 (mis)reading the code?

It was actually me (mis)reading the code - as pointed out to Cyrill
already, setting _PAGE_PAT in a pte_t without even a comment
saying that this can only ever be done with a non-present entry
made me expect problems on Xen, because it's clear that to date
bare metal Linux doesn't care about the state of _PAGE_PAT in
present entries due to the way the PAT MSR gets set (and hence
quite likely no-one would have noticed the supposed problem
while testing).

So a comment either alongside the definition of _PAGE_SWP_SOFT_DIRTY
or directly in pte_swp_{mk,clear_}soft_dirty() would have been
the minimal thing I'd have expected for this sort of re-use of bits.
Ideally even a VM_BUG_ON(pte_present()) or similar. And perhaps
pte_swp_soft_dirty() should be either looking at the present bit
too or similarly asserting that it's clear...

In any event - I'm sorry for the red herring.

Jan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Jan Beulich
 On 21.08.13 at 19:28, Andy Lutomirski l...@amacapital.net wrote:
 On Wed, Aug 21, 2013 at 6:48 AM, David Vrabel david.vra...@citrix.com wrote:
 All,

 179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
 PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
 and _PTE_PAT.

 With a Xen PV guest, the use of the _PTE_PAT will result in the page
 having unexpected cachability which will introduce a range of subtle
 performance and correctness issues.  Xen programs the entry 4 in the PAT
 table with WC so a page that was previously WB will end up as WC.

 
 Kind of off topic, but do you have a summary of how Xen uses the high
 PAT bits?  I'm the one who wants WT, and if there's already precedent
 for using the high PAT bits, it'll be helpful.

Xen's public headers have a comment explaining this, with the
main information being this table:

 *  The PAT MSR is as follow (it is a 64-bit value, each entry is 8 bits):
 * PAT4 PAT0
 *   +---++++-+++
 *WC | WC | WB | UC | UC- | WC | WB |  = Linux
 *   +---++++-+++
 *WC | WT | WB | UC | UC- | WT | WB |  = BIOS (default when machine boots)
 *   +---++++-+++
 *WC | WP | WC | UC | UC- | WT | WB |  = Xen
 *   +---++++-+++

i.e. Xen is retaining the BIOS (and legacy from non-PAT times)
meaning of the low four entries, putting WC and WP up into
the high half. The fact that the entry 6 is defined to be WC
is perhaps a mistake - it should really be considered reserved
for an eventual future memory type (just like entry 7). It also
seems like entry 6 is documented incorrectly here for Linux and
BIOS.

Jan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread David Vrabel
On 22/08/13 00:04, Linus Torvalds wrote:
 On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov gorcu...@gmail.com wrote:

 I personally don't see bug here because

  - this swapped page soft dirty bit is set for non-present entries only,
never for present ones, just at moment we form swap pte entry

  - i don't find any code which would test for this bit directly without
is_swap_pte call
 
 Ok, having gone through the places that use swp_*soft_dirty(), I have
 to agree. Afaik, it's only ever used on a swap-entry that has (by
 definition) the P bit clear. So with or without Xen, I don't see how
 it can make any difference.
 
 David/Konrad - did you actually see any issues, or was this just from
 (mis)reading the code?

There are no Xen related bugs in the code, we were misreading it.

It was my call to raise this as a regression without a repro and clearly
this was the wrong decision.

However, having looked at the soft dirty implementation and specifically
the userspace ABI I think that it is far to closely coupled to the
current implementation.  I think this will constrain future development
of the feature should userspace require a more efficient ABI than
scanning all of /proc/pid/pagemaps.

Minimal downtime during 'live' checkpointing of a running task needs the
checkpointer to find and write out dirty pages faster than the task can
dirty them.  This seems less likely to be possible if every iteration
all PTEs have to be scanned by the checkpointer instead of (e.g.,)
accessing a separate list of dirtied pages.

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Pavel Emelyanov
On 08/22/2013 01:32 PM, David Vrabel wrote:
 On 22/08/13 00:04, Linus Torvalds wrote:
 On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov gorcu...@gmail.com wrote:

 I personally don't see bug here because

  - this swapped page soft dirty bit is set for non-present entries only,
never for present ones, just at moment we form swap pte entry

  - i don't find any code which would test for this bit directly without
is_swap_pte call

 Ok, having gone through the places that use swp_*soft_dirty(), I have
 to agree. Afaik, it's only ever used on a swap-entry that has (by
 definition) the P bit clear. So with or without Xen, I don't see how
 it can make any difference.

 David/Konrad - did you actually see any issues, or was this just from
 (mis)reading the code?
 
 There are no Xen related bugs in the code, we were misreading it.
 
 It was my call to raise this as a regression without a repro and clearly
 this was the wrong decision.
 
 However, having looked at the soft dirty implementation and specifically
 the userspace ABI I think that it is far to closely coupled to the
 current implementation.  I think this will constrain future development
 of the feature should userspace require a more efficient ABI than
 scanning all of /proc/pid/pagemaps.
 
 Minimal downtime during 'live' checkpointing of a running task needs the
 checkpointer to find and write out dirty pages faster than the task can
 dirty them.

Absolutely, but in find and write the write component is likely to take
the majority of time -- we can scan PTEs of a mapping MUCH faster, than
transmitting those over even 10Gbit link.

We actually see this IRL -- in CRIU there's an atomic test, that checks
mappings get dumped and restored properly. One of sub-tests is one 512Mb 
mapping.
With it total dump time _minus_ memory dump time (which includes not only 
pagemap
file scan, but also files, registers, process tree, sessions, etc.) is 
fractions 
of one second, while only the memory dump part's time is several seconds.

That said, the super-fast API for getting what has changed is not as tempting
to have as faster network/disk.

What is _more_ time consuming in iterative migration in our case is the need to
re-scan the whole /proc tree to get which processes had died and appeared, mess 
with /proc/pid/fd finding out what files were (re-)opened/closed/changed, 
talking
to sock_diag subsystem for sockets information and alike. However, we haven't 
yet
done careful analysis for what the slowest part is, but pagemap scans is 
definitely
not.

 This seems less likely to be possible if every iteration
 all PTEs have to be scanned by the checkpointer instead of (e.g.,)
 accessing a separate list of dirtied pages.

But we don't scan all the x64 virtual address space's PTEs, instead we first
analyze the /proc/pid/maps and scan only PTEs sitting in private mappings.

 David
 .
 

Thanks,
Pavel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Cyrill Gorcunov
On Thu, Aug 22, 2013 at 08:27:45AM +0100, Jan Beulich wrote:
  On 22.08.13 at 09:03, Cyrill Gorcunov gorcu...@gmail.com wrote:
  Ok, how about this?
  
  static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
  {
  BUG_ON(pte_present(pte));
  return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
  }
 
 Sure, fine with me. Perhaps VM_BUG_ON() or some other similar
 construct limiting the scope when any extra code gets generated
 would do too.

Sorry for delay, the patch is below.

 
 But as said, even better would perhaps be to have it act on a
 swp_entry_t.

swp_entry_t is too small already to keep additional status bit,
unfortunately.
---
From: Cyrill Gorcunov gorcu...@gmail.com
Subject: [PATCH] mm: Make sure _PAGE_SWP_SOFT_DIRTY bit is not set on present 
pte

_PAGE_SOFT_DIRTY bit should never be set on present pte so add
VM_BUG_ON to catch any potential future abuse.

Also add a comment on _PAGE_SWP_SOFT_DIRTY definition explaining
scope of its usage.

Signed-off-by: Cyrill Gorcunov gorcu...@openvz.org
Cc: Andy Lutomirski l...@amacapital.net
Cc: Pavel Emelyanov xe...@parallels.com
Cc: Andrew Morton a...@linux-foundation.org
Cc: Matt Mackall m...@selenic.com
Cc: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
Cc: Marcelo Tosatti mtosa...@redhat.com
Cc: KOSAKI Motohiro kosaki.motoh...@gmail.com
Cc: Stephen Rothwell s...@canb.auug.org.au
Cc: Peter Zijlstra pet...@infradead.org
Cc: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Cc: David Vrabel david.vra...@citrix.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com
Cc: Boris Ostrovsky boris.ostrov...@oracle.com
Cc: Jan Beulich jbeul...@suse.com
---
 arch/x86/include/asm/pgtable.h   |   34 +++---
 arch/x86/include/asm/pgtable_types.h |3 +++
 2 files changed, 22 insertions(+), 15 deletions(-)

Index: linux-2.6.git/arch/x86/include/asm/pgtable.h
===
--- linux-2.6.git.orig/arch/x86/include/asm/pgtable.h
+++ linux-2.6.git/arch/x86/include/asm/pgtable.h
@@ -314,21 +314,6 @@ static inline pmd_t pmd_mksoft_dirty(pmd
return pmd_set_flags(pmd, _PAGE_SOFT_DIRTY);
 }
 
-static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
-{
-   return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
-}
-
-static inline int pte_swp_soft_dirty(pte_t pte)
-{
-   return pte_flags(pte)  _PAGE_SWP_SOFT_DIRTY;
-}
-
-static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
-{
-   return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
-}
-
 static inline pte_t pte_file_clear_soft_dirty(pte_t pte)
 {
return pte_clear_flags(pte, _PAGE_SOFT_DIRTY);
@@ -445,6 +430,7 @@ pte_t *populate_extra_pte(unsigned long
 
 #ifndef __ASSEMBLY__
 #include linux/mm_types.h
+#include linux/mmdebug.h
 #include linux/log2.h
 
 static inline int pte_none(pte_t pte)
@@ -863,6 +849,24 @@ static inline void update_mmu_cache_pmd(
 {
 }
 
+static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
+{
+   VM_BUG_ON(pte_present(pte));
+   return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
+}
+
+static inline int pte_swp_soft_dirty(pte_t pte)
+{
+   VM_BUG_ON(pte_present(pte));
+   return pte_flags(pte)  _PAGE_SWP_SOFT_DIRTY;
+}
+
+static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
+{
+   VM_BUG_ON(pte_present(pte));
+   return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
+}
+
 #include asm-generic/pgtable.h
 #endif /* __ASSEMBLY__ */
 
Index: linux-2.6.git/arch/x86/include/asm/pgtable_types.h
===
--- linux-2.6.git.orig/arch/x86/include/asm/pgtable_types.h
+++ linux-2.6.git/arch/x86/include/asm/pgtable_types.h
@@ -75,6 +75,9 @@
  * with swap entry format. On x86 bits 6 and 7 are *not* involved
  * into swap entry computation, but bit 6 is used for nonlinear
  * file mapping, so we borrow bit 7 for soft dirty tracking.
+ *
+ * Please note that this bit must be treated as swap dirty page
+ * mark if and only if the PTE has present bit clear!
  */
 #ifdef CONFIG_MEM_SOFT_DIRTY
 #define _PAGE_SWP_SOFT_DIRTY   _PAGE_PSE
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Jan Beulich
 On 22.08.13 at 13:27, Cyrill Gorcunov gorcu...@gmail.com wrote:
 On Thu, Aug 22, 2013 at 08:27:45AM +0100, Jan Beulich wrote:
  On 22.08.13 at 09:03, Cyrill Gorcunov gorcu...@gmail.com wrote:
  Ok, how about this?
  
  static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
  {
 BUG_ON(pte_present(pte));
 return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
  }
 
 Sure, fine with me. Perhaps VM_BUG_ON() or some other similar
 construct limiting the scope when any extra code gets generated
 would do too.
 
 Sorry for delay, the patch is below.
 
 
 But as said, even better would perhaps be to have it act on a
 swp_entry_t.
 
 swp_entry_t is too small already to keep additional status bit,
 unfortunately.
 ---
 From: Cyrill Gorcunov gorcu...@gmail.com
 Subject: [PATCH] mm: Make sure _PAGE_SWP_SOFT_DIRTY bit is not set on 
 present pte
 
 _PAGE_SOFT_DIRTY bit should never be set on present pte so add
 VM_BUG_ON to catch any potential future abuse.
 
 Also add a comment on _PAGE_SWP_SOFT_DIRTY definition explaining
 scope of its usage.
 
 Signed-off-by: Cyrill Gorcunov gorcu...@openvz.org

Acked-by: Jan Beulich jbeul...@suse.com

Thanks, Jan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Pavel Emelyanov
On 08/22/2013 03:33 PM, Jan Beulich wrote:

 From: Cyrill Gorcunov gorcu...@gmail.com
 Subject: [PATCH] mm: Make sure _PAGE_SWP_SOFT_DIRTY bit is not set on 
 present pte

 _PAGE_SOFT_DIRTY bit should never be set on present pte so add
 VM_BUG_ON to catch any potential future abuse.

 Also add a comment on _PAGE_SWP_SOFT_DIRTY definition explaining
 scope of its usage.

 Signed-off-by: Cyrill Gorcunov gorcu...@openvz.org
 
 Acked-by: Jan Beulich jbeul...@suse.com

Acked-by: Pavel Emelyanov xe...@parallels.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-22 Thread Cyrill Gorcunov
On Thu, Aug 22, 2013 at 03:37:33PM +0900, Minchan Kim wrote:
  
  Look, good thing is that 7th bit also available on the 4level pages
  (ie x86-64) without additional code modification, that's why I picked
  it in first place. I prepared the patch locally which doesn't use
  pse bit for tracking but it only makes code more complex.
 
 I'm not sure you read this thread.
 http://comments.gmane.org/gmane.linux.kernel.mm/101756
 In summary, I'd like to use it to track sub-ranges of some processes.
 
 I already had a time to investigate and it enhanced our workload x2 on ARM
 so I'd like to expand the concept for more general purpose.
 
 For it, arch-specific stuff would be hurdle for port.
 So, I support your non-arch-specific solutions.
 Just FYI. Such future plan shouldn't force you.

Hi Minchan, thanks for info, I'll read the thread (sorry for delay,
managed to miss your email).
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 04:42:53PM -0700, Andi Kleen wrote:
> Cyrill Gorcunov  writes:
> >
> > Hi all, I worked on patch which would not touch PSE bit for dirty page
> > tracking and the result is not that good:
> >
> >  - 2level pages now always page dirty if page is swapped in and out, because
> >there is no space left in PTE (other than PSE bit)
> 
> Maybe just don't support soft dirty for 2 level page tables?
> 
> 2 level page tables should be really on the way out anyways, as they
> have severe limits and do not support NX. With 3 levels there is enough
> space.

Look, good thing is that 7th bit also available on the 4level pages
(ie x86-64) without additional code modification, that's why I picked
it in first place. I prepared the patch locally which doesn't use
pse bit for tracking but it only makes code more complex.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 08:51:15PM -0400, Dave Jones wrote:
> 
> Could this explain what I'm seeing in another thread ?
> https://lkml.org/lkml/2013/8/7/27

Don't think so. This code is merged in -rc6, while your report is saying the
kernel version is -rc4 (also this feature requires CONFIG_MEM_SOFT_DIRTY
to be turned on, do you have it?).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Dave Jones
On Wed, Aug 21, 2013 at 04:04:54PM -0700, Linus Torvalds wrote:
 > On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov  wrote:
 > >
 > > I personally don't see bug here because
 > >
 > >  - this swapped page soft dirty bit is set for non-present entries only,
 > >never for present ones, just at moment we form swap pte entry
 > >
 > >  - i don't find any code which would test for this bit directly without
 > >is_swap_pte call
 > 
 > Ok, having gone through the places that use swp_*soft_dirty(), I have
 > to agree. Afaik, it's only ever used on a swap-entry that has (by
 > definition) the P bit clear. So with or without Xen, I don't see how
 > it can make any difference.
 > 
 > David/Konrad - did you actually see any issues, or was this just from
 > (mis)reading the code?

Could this explain what I'm seeing in another thread ?
https://lkml.org/lkml/2013/8/7/27

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Andi Kleen
Cyrill Gorcunov  writes:
>
> Hi all, I worked on patch which would not touch PSE bit for dirty page
> tracking and the result is not that good:
>
>  - 2level pages now always page dirty if page is swapped in and out, because
>there is no space left in PTE (other than PSE bit)

Maybe just don't support soft dirty for 2 level page tables?

2 level page tables should be really on the way out anyways, as they
have severe limits and do not support NX. With 3 levels there is enough
space.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 09:30:03AM -0700, Linus Torvalds wrote:
> Quite frankly, unless I see a patch later today that is
> 
>  (a) obvious
>  (b) explains what is going on
>  (c) tested
> 
> I will be reverting the whole soft-dirty mess. I thought the
> bit-mapping games it played were already too complicated (the patch to
> pgtable-2level.h in commit 41bb3476b361 just makes me want to barf and
> came in very late, so I'm not positive about the whole soft-dirty mess
> in the first place). I really am not at all inclined to want to play
> games in this area any more. It's too damn late in the release window.

Hi all, I worked on patch which would not touch PSE bit for dirty page
tracking and the result is not that good:

 - 2level pages now always page dirty if page is swapped in and out, because
   there is no space left in PTE (other than PSE bit)

 - only 3level pages scheme uses high 32bits to keep offset of swap entry,
   x86-64 shifts offset up to _PAGE_BIT_GLOBAL + 1 bit, thus I need some
   different bit nonunified with anything else for no reason :(

Summarizing all things

 - Using PSE bit for swap entries as indicator of soft dirty page is safe 
because
   swap entries as saved in pte as non-presen and when #pf happens kernel 
generates
   valid pte entry from vma->vm_page_prot

 - __swp_entry() helper is clearing PSE bit explicitly so even without softdirty
   patch it's not saved once page reach swap (with softdirty tracking we simply
   reuse this bit for own needs).

 - Using PSE bit allows to not modify swap encoding on all 3 page schemes 
(2level,
   3level, 4level) because it's a spare bit there not intersected with swap 
format.

Thus I would *_really_* like to save current scheme. Probably I should add 
comment
into header where _PAGE_SWP_SOFT_DIRTY defined that it's valid only when PRESENT
bit clear? Similar to

/* If _PAGE_BIT_PRESENT is clear, we use these: */
/* - if the user mapped it with PROT_NONE; pte_present gives true */
#define _PAGE_BIT_PROTNONE  _PAGE_BIT_GLOBAL
/* - set: nonlinear file mapping, saved PTE; unset:swap */
#define _PAGE_BIT_FILE  _PAGE_BIT_DIRTY

Have I conviced you guys?

The former problem report came from impression that this PSE bit may be touched
(set and clean) on present PTE, but it's not the case for pages being swapped.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Linus Torvalds
On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov  wrote:
>
> I personally don't see bug here because
>
>  - this swapped page soft dirty bit is set for non-present entries only,
>never for present ones, just at moment we form swap pte entry
>
>  - i don't find any code which would test for this bit directly without
>is_swap_pte call

Ok, having gone through the places that use swp_*soft_dirty(), I have
to agree. Afaik, it's only ever used on a swap-entry that has (by
definition) the P bit clear. So with or without Xen, I don't see how
it can make any difference.

David/Konrad - did you actually see any issues, or was this just from
(mis)reading the code?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Pavel Emelyanov
On 08/21/2013 11:07 PM, Andy Lutomirski wrote:
> On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov  wrote:
>> On Wed, Aug 21, 2013 at 08:50:26PM +0200, H. Peter Anvin wrote:
>>> But is there a manifest bug or not?  What is the deal with Xen?
>>>
>>
>> I personally don't see bug here because
>>
>>  - this swapped page soft dirty bit is set for non-present entries only,
>>never for present ones, just at moment we form swap pte entry
>>
>>  - i don't find any code which would test for this bit directly without
>>is_swap_pte call
>>
>> but the use of paw bit itself is confusing, so I'm working on patch which
>> won't use it. Again, if someone knows where exactly access to pse bit when
>> pte keeps swap entry may happen (for any purpose other than dirty page
>> tracking) please share.
> 
> I doubt that there are cacheability issues here, since the swap type
> already overlaps with PWT and PCD.
> 
> What happens if the page being swapped is a THP page?  (I have no idea
> how, or even if, this works, but presumably PSE is important.)

Huge-page is splitted into small ones, then the smaller ones get swapped out.

> --Andy
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 12:07:13PM -0700, Andy Lutomirski wrote:
> On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov  wrote:
> > On Wed, Aug 21, 2013 at 08:50:26PM +0200, H. Peter Anvin wrote:
> >> But is there a manifest bug or not?  What is the deal with Xen?
> >>
> >
> > I personally don't see bug here because
> >
> >  - this swapped page soft dirty bit is set for non-present entries only,
> >never for present ones, just at moment we form swap pte entry
> >
> >  - i don't find any code which would test for this bit directly without
> >is_swap_pte call
> >
> > but the use of paw bit itself is confusing, so I'm working on patch which
> > won't use it. Again, if someone knows where exactly access to pse bit when
> > pte keeps swap entry may happen (for any purpose other than dirty page
> > tracking) please share.
> 
> I doubt that there are cacheability issues here, since the swap type
> already overlaps with PWT and PCD.
> 
> What happens if the page being swapped is a THP page?  (I have no idea
> how, or even if, this works, but presumably PSE is important.)

add_to_swap -> split_huge_page_to_list, so thp will be splitted.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Andy Lutomirski
On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov  wrote:
> On Wed, Aug 21, 2013 at 08:50:26PM +0200, H. Peter Anvin wrote:
>> But is there a manifest bug or not?  What is the deal with Xen?
>>
>
> I personally don't see bug here because
>
>  - this swapped page soft dirty bit is set for non-present entries only,
>never for present ones, just at moment we form swap pte entry
>
>  - i don't find any code which would test for this bit directly without
>is_swap_pte call
>
> but the use of paw bit itself is confusing, so I'm working on patch which
> won't use it. Again, if someone knows where exactly access to pse bit when
> pte keeps swap entry may happen (for any purpose other than dirty page
> tracking) please share.

I doubt that there are cacheability issues here, since the swap type
already overlaps with PWT and PCD.

What happens if the page being swapped is a THP page?  (I have no idea
how, or even if, this works, but presumably PSE is important.)

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 08:50:26PM +0200, H. Peter Anvin wrote:
> But is there a manifest bug or not?  What is the deal with Xen?
> 

I personally don't see bug here because

 - this swapped page soft dirty bit is set for non-present entries only,
   never for present ones, just at moment we form swap pte entry

 - i don't find any code which would test for this bit directly without
   is_swap_pte call

but the use of paw bit itself is confusing, so I'm working on patch which
won't use it. Again, if someone knows where exactly access to pse bit when
pte keeps swap entry may happen (for any purpose other than dirty page
tracking) please share.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread H. Peter Anvin
But is there a manifest bug or not?  What is the deal with Xen?

Cyrill Gorcunov  wrote:
>On Wed, Aug 21, 2013 at 09:25:47PM +0400, Cyrill Gorcunov wrote:
>> > 
>> > However, I do find the use of PTE bits in this way somewhat
>fragile.
>> > What other potential corner cases might still remain that will
>require
>> > further games with PTE bits?
>> 
>>   OK, so this is not a bug finally. The problem is that 2 level pte
>is
>> quite small and 7th bit is the only one spare I can use for soft
>dirty
>> tracking when page get swapped out. And swap engine is very depending
>> on pte being non-present, so we are on a safe side.
>
>To make it clear: I'm working on patch that won't use pse bit for dirty
>page tracking, just give some time to cook it and test.

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 09:25:47PM +0400, Cyrill Gorcunov wrote:
> > 
> > However, I do find the use of PTE bits in this way somewhat fragile.
> > What other potential corner cases might still remain that will require
> > further games with PTE bits?
> 
>   OK, so this is not a bug finally. The problem is that 2 level pte is
> quite small and 7th bit is the only one spare I can use for soft dirty
> tracking when page get swapped out. And swap engine is very depending
> on pte being non-present, so we are on a safe side.

To make it clear: I'm working on patch that won't use pse bit for dirty
page tracking, just give some time to cook it and test.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Andy Lutomirski
On Wed, Aug 21, 2013 at 6:48 AM, David Vrabel  wrote:
> All,
>
> 179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
> PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
> and _PTE_PAT.
>
> With a Xen PV guest, the use of the _PTE_PAT will result in the page
> having unexpected cachability which will introduce a range of subtle
> performance and correctness issues.  Xen programs the entry 4 in the PAT
> table with WC so a page that was previously WB will end up as WC.
>

Kind of off topic, but do you have a summary of how Xen uses the high
PAT bits?  I'm the one who wants WT, and if there's already precedent
for using the high PAT bits, it'll be helpful.

(Also, Xen is a little weird, and I'd rather not break it again by
accident like I did with the vsyscall changes a couple years ago.)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 05:56:08PM +0100, David Vrabel wrote:
> > 
> > And I asked David to point me how it happens, because I don't
> > understand at which point pse bit get analized when page is
> > not present.
> 
> As Jan said, we're concerned that the bit was being used on present PTEs
> and not just non-present ones.  From a more careful look at this code
> this does not appear to be the case.
> 
> However, I do find the use of PTE bits in this way somewhat fragile.
> What other potential corner cases might still remain that will require
> further games with PTE bits?

  OK, so this is not a bug finally. The problem is that 2 level pte is
quite small and 7th bit is the only one spare I can use for soft dirty
tracking when page get swapped out. And swap engine is very depending
on pte being non-present, so we are on a safe side.

> FWIW, Xen uses a separate dirty log to track which pages have become
> dirty since the log was last cleared.  Such a dirty log seems more
> efficient than having scan all the PTEs looking for the soft dirty bits
> and then having to scan them all again to clear them (particularly if
> you need multiple passes because the task is still running and
> continuing to dirty pages).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread David Vrabel
On 21/08/13 17:19, Cyrill Gorcunov wrote:
> On Wed, Aug 21, 2013 at 05:03:13PM +0100, Jan Beulich wrote:
>>>
>>> Only to non-present ptes, as far as I know.
>>
>> That's not really any guarantee. And the accessor functions also
>> don't check that they'd be used on non-present PTEs only.
> 
> Wait. This _PAGE_SWP_SOFT_DIRTY bit (which is in real PSE bit) assigned
> in only one place -- in try_to_unmap_one(). The PTE get non-present then
> and consists of swap entry format. I don't see any accessor to such entry
> without testing if it's swap entry or pte-none. What I'm missing?
> 
>>> orig_pte has pse bit set if page has been soft dirty
>>> when it reached swap.
>>
>> "when it reached swap" to me again implies that it could come
>> from a live page table, with the present bit set. So that
>> explanation attempt of yours confuses me more than it
>> clarifies things for me. (And referring to this bit as PSE bit is
> 
> When page swapped out it become non-present in pte entry.
> 
>> sort of wrong here too - there's no PSE bit for 4k PTEs, that
>> bit is the PAT one, and that's what the whole discussion
>> started from.)
> 
> And I asked David to point me how it happens, because I don't
> understand at which point pse bit get analized when page is
> not present.

As Jan said, we're concerned that the bit was being used on present PTEs
and not just non-present ones.  From a more careful look at this code
this does not appear to be the case.

However, I do find the use of PTE bits in this way somewhat fragile.
What other potential corner cases might still remain that will require
further games with PTE bits?

FWIW, Xen uses a separate dirty log to track which pages have become
dirty since the log was last cleared.  Such a dirty log seems more
efficient than having scan all the PTEs looking for the soft dirty bits
and then having to scan them all again to clear them (particularly if
you need multiple passes because the task is still running and
continuing to dirty pages).

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 09:30:03AM -0700, Linus Torvalds wrote:
> On Wed, Aug 21, 2013 at 7:19 AM, Cyrill Gorcunov  wrote:
> > On Wed, Aug 21, 2013 at 04:11:25PM +0200, H. Peter Anvin wrote:
> >>
> >> Eep.  This should be reverted, indeed.  This isn't a manifest bug on !Xen 
> >> but we
> >> have gotten requests for WT support which would mean adding in the PAT but 
> >> again.
> >
> > Please no, letme fix it. That's what I'm having in mind: don't use pse bit 
> > in swap
> > entry but always mark page read from swap as dirty one (it's better than 
> > having
> > no tracker at all and  will be the case on old machines with 32bit ptes 
> > only).
> 
> Quite frankly, unless I see a patch later today that is

I'll do my best to fix it, but I've not yet been shown where the problem
happens. Neither I'm convinced if it is here at all. I'm asking David for 
details.

> 
>  (a) obvious
>  (b) explains what is going on
>  (c) tested
> 
> I will be reverting the whole soft-dirty mess. I thought the
> bit-mapping games it played were already too complicated (the patch to
> pgtable-2level.h in commit 41bb3476b361 just makes me want to barf and
> came in very late, so I'm not positive about the whole soft-dirty mess

This was not my invention on such encoding, I simply had no choise but
follow existing scheme. Still there are cleanup patches in -mm tree which
make code a way more readable (even former one, without soft-dirty bit).

> in the first place). I really am not at all inclined to want to play
> games in this area any more. It's too damn late in the release window.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Linus Torvalds
On Wed, Aug 21, 2013 at 7:19 AM, Cyrill Gorcunov  wrote:
> On Wed, Aug 21, 2013 at 04:11:25PM +0200, H. Peter Anvin wrote:
>>
>> Eep.  This should be reverted, indeed.  This isn't a manifest bug on !Xen 
>> but we
>> have gotten requests for WT support which would mean adding in the PAT but 
>> again.
>
> Please no, letme fix it. That's what I'm having in mind: don't use pse bit in 
> swap
> entry but always mark page read from swap as dirty one (it's better than 
> having
> no tracker at all and  will be the case on old machines with 32bit ptes only).

Quite frankly, unless I see a patch later today that is

 (a) obvious
 (b) explains what is going on
 (c) tested

I will be reverting the whole soft-dirty mess. I thought the
bit-mapping games it played were already too complicated (the patch to
pgtable-2level.h in commit 41bb3476b361 just makes me want to barf and
came in very late, so I'm not positive about the whole soft-dirty mess
in the first place). I really am not at all inclined to want to play
games in this area any more. It's too damn late in the release window.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 05:03:13PM +0100, Jan Beulich wrote:
> > 
> > Only to non-present ptes, as far as I know.
> 
> That's not really any guarantee. And the accessor functions also
> don't check that they'd be used on non-present PTEs only.

Wait. This _PAGE_SWP_SOFT_DIRTY bit (which is in real PSE bit) assigned
in only one place -- in try_to_unmap_one(). The PTE get non-present then
and consists of swap entry format. I don't see any accessor to such entry
without testing if it's swap entry or pte-none. What I'm missing?

> > orig_pte has pse bit set if page has been soft dirty
> > when it reached swap.
> 
> "when it reached swap" to me again implies that it could come
> from a live page table, with the present bit set. So that
> explanation attempt of yours confuses me more than it
> clarifies things for me. (And referring to this bit as PSE bit is

When page swapped out it become non-present in pte entry.

> sort of wrong here too - there's no PSE bit for 4k PTEs, that
> bit is the PAT one, and that's what the whole discussion
> started from.)

And I asked David to point me how it happens, because I don't
understand at which point pse bit get analized when page is
not present.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Jan Beulich
>>> On 21.08.13 at 17:42, Cyrill Gorcunov  wrote:
> On Wed, Aug 21, 2013 at 03:53:36PM +0100, Jan Beulich wrote:
>> >>> On 21.08.13 at 16:12, Cyrill Gorcunov  wrote:
>> > David, could you please explain, Xen keeps and analyze _PTE_PAT bit
>> > for ptes which are not present?
>> 
>> No, the problem isn't with not-present PTEs (i.e. swap entries),
>> but with present ones - the same bit (7) is being used for both,
>> according to this comment:
>> 
>> /*
>>  * Tracking soft dirty bit when a page goes to a swap is tricky.
>>  * We need a bit which can be stored in pte _and_ not conflict
>>  * with swap entry format. On x86 bits 6 and 7 are *not* involved
>>  * into swap entry computation, but bit 6 is used for nonlinear
>>  * file mapping, so we borrow bit 7 for soft dirty tracking.
>>  */
>> 
>> Or are you telling me that the comment is misleading (at least me),
>> and this applies only to not-present PTEs? And even then - where
>> would the value of the original PAT bit be stored while swapped
>> out (or is it impossible - now and forever - for WC pages to get
>> swapped)?
> 
> Only to non-present ptes, as far as I know.

That's not really any guarantee. And the accessor functions also
don't check that they'd be used on non-present PTEs only.

> do_swap_page
>   ...
>   pte = mk_pte(page, vma->vm_page_prot);
> 
>   /* new pte from vm_page_prot generated */
>   ...
>   set_pte_at(mm, address, page_table, pte);
>   /* and assigned to old place */
> 
> with soft dirty in swap it is somehow more weirdy
> 
>   pte = mk_pte(page, vma->vm_page_prot);
>   ...
>   if (pte_swp_soft_dirty(orig_pte))
>   pte = pte_mksoft_dirty(pte);
>   set_pte_at(mm, address, page_table, pte);
> 
> orig_pte has pse bit set if page has been soft dirty
> when it reached swap.

"when it reached swap" to me again implies that it could come
from a live page table, with the present bit set. So that
explanation attempt of yours confuses me more than it
clarifies things for me. (And referring to this bit as PSE bit is
sort of wrong here too - there's no PSE bit for 4k PTEs, that
bit is the PAT one, and that's what the whole discussion
started from.)

Jan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 03:53:36PM +0100, Jan Beulich wrote:
> >>> On 21.08.13 at 16:12, Cyrill Gorcunov  wrote:
> > On Wed, Aug 21, 2013 at 02:48:20PM +0100, David Vrabel wrote:
> >> All,
> >> 
> >> 179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
> >> PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
> >> and _PTE_PAT.
> >> 
> >> With a Xen PV guest, the use of the _PTE_PAT will result in the page
> >> having unexpected cachability which will introduce a range of subtle
> >> performance and correctness issues.  Xen programs the entry 4 in the PAT
> >> table with WC so a page that was previously WB will end up as WC.
> >> 
> > 
> > David, could you please explain, Xen keeps and analyze _PTE_PAT bit
> > for ptes which are not present?
> 
> No, the problem isn't with not-present PTEs (i.e. swap entries),
> but with present ones - the same bit (7) is being used for both,
> according to this comment:
> 
> /*
>  * Tracking soft dirty bit when a page goes to a swap is tricky.
>  * We need a bit which can be stored in pte _and_ not conflict
>  * with swap entry format. On x86 bits 6 and 7 are *not* involved
>  * into swap entry computation, but bit 6 is used for nonlinear
>  * file mapping, so we borrow bit 7 for soft dirty tracking.
>  */
> 
> Or are you telling me that the comment is misleading (at least me),
> and this applies only to not-present PTEs? And even then - where
> would the value of the original PAT bit be stored while swapped
> out (or is it impossible - now and forever - for WC pages to get
> swapped)?

Only to non-present ptes, as far as I know.

do_swap_page
...
pte = mk_pte(page, vma->vm_page_prot);

/* new pte from vm_page_prot generated */
...
set_pte_at(mm, address, page_table, pte);
/* and assigned to old place */

with soft dirty in swap it is somehow more weirdy

pte = mk_pte(page, vma->vm_page_prot);
...
if (pte_swp_soft_dirty(orig_pte))
pte = pte_mksoft_dirty(pte);
set_pte_at(mm, address, page_table, pte);

orig_pte has pse bit set if page has been soft dirty
when it reached swap.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread H. Peter Anvin
Only WB pages should be swappable, but even so, the cacheability should be in 
the vma.

Jan Beulich  wrote:
 On 21.08.13 at 16:12, Cyrill Gorcunov  wrote:
>> On Wed, Aug 21, 2013 at 02:48:20PM +0100, David Vrabel wrote:
>>> All,
>>> 
>>> 179ef71c (mm: save soft-dirty bits on swapped pages) introduces a
>new
>>> PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as
>_PTE_PSE
>>> and _PTE_PAT.
>>> 
>>> With a Xen PV guest, the use of the _PTE_PAT will result in the page
>>> having unexpected cachability which will introduce a range of subtle
>>> performance and correctness issues.  Xen programs the entry 4 in the
>PAT
>>> table with WC so a page that was previously WB will end up as WC.
>>> 
>> 
>> David, could you please explain, Xen keeps and analyze _PTE_PAT bit
>> for ptes which are not present?
>
>No, the problem isn't with not-present PTEs (i.e. swap entries),
>but with present ones - the same bit (7) is being used for both,
>according to this comment:
>
>/*
> * Tracking soft dirty bit when a page goes to a swap is tricky.
> * We need a bit which can be stored in pte _and_ not conflict
> * with swap entry format. On x86 bits 6 and 7 are *not* involved
> * into swap entry computation, but bit 6 is used for nonlinear
> * file mapping, so we borrow bit 7 for soft dirty tracking.
> */
>
>Or are you telling me that the comment is misleading (at least me),
>and this applies only to not-present PTEs? And even then - where
>would the value of the original PAT bit be stored while swapped
>out (or is it impossible - now and forever - for WC pages to get
>swapped)?
>
>Jan

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Jan Beulich
>>> On 21.08.13 at 16:12, Cyrill Gorcunov  wrote:
> On Wed, Aug 21, 2013 at 02:48:20PM +0100, David Vrabel wrote:
>> All,
>> 
>> 179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
>> PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
>> and _PTE_PAT.
>> 
>> With a Xen PV guest, the use of the _PTE_PAT will result in the page
>> having unexpected cachability which will introduce a range of subtle
>> performance and correctness issues.  Xen programs the entry 4 in the PAT
>> table with WC so a page that was previously WB will end up as WC.
>> 
> 
> David, could you please explain, Xen keeps and analyze _PTE_PAT bit
> for ptes which are not present?

No, the problem isn't with not-present PTEs (i.e. swap entries),
but with present ones - the same bit (7) is being used for both,
according to this comment:

/*
 * Tracking soft dirty bit when a page goes to a swap is tricky.
 * We need a bit which can be stored in pte _and_ not conflict
 * with swap entry format. On x86 bits 6 and 7 are *not* involved
 * into swap entry computation, but bit 6 is used for nonlinear
 * file mapping, so we borrow bit 7 for soft dirty tracking.
 */

Or are you telling me that the comment is misleading (at least me),
and this applies only to not-present PTEs? And even then - where
would the value of the original PAT bit be stored while swapped
out (or is it impossible - now and forever - for WC pages to get
swapped)?

Jan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 04:22:21PM +0200, H. Peter Anvin wrote:
> OK now I'm confused.  I guess I shouldn't comment while on vacation
> and cache cold on everything.

I rather think I'm missing something, that's why I asked David how
this featue affects non present pte.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread H. Peter Anvin
Good question...

Cyrill Gorcunov  wrote:
>On Wed, Aug 21, 2013 at 02:48:20PM +0100, David Vrabel wrote:
>> All,
>> 
>> 179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
>> PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as
>_PTE_PSE
>> and _PTE_PAT.
>> 
>> With a Xen PV guest, the use of the _PTE_PAT will result in the page
>> having unexpected cachability which will introduce a range of subtle
>> performance and correctness issues.  Xen programs the entry 4 in the
>PAT
>> table with WC so a page that was previously WB will end up as WC.
>> 
>
>David, could you please explain, Xen keeps and analyze _PTE_PAT bit
>for ptes which are not present?

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread H. Peter Anvin
OK now I'm confused.  I guess I shouldn't comment while on vacation and cache 
cold on everything.

Cyrill Gorcunov  wrote:
>On Wed, Aug 21, 2013 at 04:11:25PM +0200, H. Peter Anvin wrote:
>>
>> Eep.  This should be reverted, indeed.  This isn't a manifest bug on
>!Xen but we
>> have gotten requests for WT support which would mean adding in the
>PAT but again.
>
>Please no, letme fix it. That's what I'm having in mind: don't use pse
>bit in swap
>entry but always mark page read from swap as dirty one (it's better
>than having
>no tracker at all and  will be the case on old machines with 32bit ptes
>only).

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 04:11:25PM +0200, H. Peter Anvin wrote:
>
> Eep.  This should be reverted, indeed.  This isn't a manifest bug on !Xen but 
> we
> have gotten requests for WT support which would mean adding in the PAT but 
> again.

Please no, letme fix it. That's what I'm having in mind: don't use pse bit in 
swap
entry but always mark page read from swap as dirty one (it's better than having
no tracker at all and  will be the case on old machines with 32bit ptes only).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread H. Peter Anvin
Eep.  This should be reverted, indeed.  This isn't a manifest bug on !Xen but 
we have gotten requests for WT support which would mean adding in the PAT but 
again.

David Vrabel  wrote:
>All,
>
>179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
>PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
>and _PTE_PAT.
>
>With a Xen PV guest, the use of the _PTE_PAT will result in the page
>having unexpected cachability which will introduce a range of subtle
>performance and correctness issues.  Xen programs the entry 4 in the
>PAT
>table with WC so a page that was previously WB will end up as WC.
>
>The use of this bit also appears to preclude the use of (transparent)
>huge pages by the application.  It is not clear if there is something
>else guaranteeing that that there will be no huge pages.
>
>To fix this regression I suggest one or more of:
>
>1. If no other changes are made, at a mimimum, MEM_SOFT_DIRTY must
>require !XEN and possibly !TRANSPARENT_HUGEPAGE and !HUGETLBFS.  This
>would prevent this option being enabled on the majority of standard
>Linux distributions.
>
>2. Find a different PTE bit to (re)use.
>
>3. Avoid clearing the soft dirty bit when repopulating a swapped out
>page.
>
>4. Redesign the soft dirty tracking to not require the use of
>architecture specific PTE bits.  e.g., by using a shadow set of
>structures for the soft dirty bit tracking.
>
>David

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 02:48:20PM +0100, David Vrabel wrote:
> All,
> 
> 179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
> PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
> and _PTE_PAT.
> 
> With a Xen PV guest, the use of the _PTE_PAT will result in the page
> having unexpected cachability which will introduce a range of subtle
> performance and correctness issues.  Xen programs the entry 4 in the PAT
> table with WC so a page that was previously WB will end up as WC.
> 

David, could you please explain, Xen keeps and analyze _PTE_PAT bit
for ptes which are not present?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread konrad wilk


On 8/21/2013 9:48 AM, David Vrabel wrote:

All,

179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
and _PTE_PAT.

With a Xen PV guest, the use of the _PTE_PAT will result in the page
having unexpected cachability which will introduce a range of subtle
performance and correctness issues.  Xen programs the entry 4 in the PAT
table with WC so a page that was previously WB will end up as WC.
Especially with filesystems which would end up using those pages (as the 
memory allocator
would recycle them) and with corruption in the filesystem. Took months 
to figure

that out.



The use of this bit also appears to preclude the use of (transparent)
huge pages by the application.  It is not clear if there is something
else guaranteeing that that there will be no huge pages.

To fix this regression I suggest one or more of:

1. If no other changes are made, at a mimimum, MEM_SOFT_DIRTY must
require !XEN and possibly !TRANSPARENT_HUGEPAGE and !HUGETLBFS.  This
would prevent this option being enabled on the majority of standard
Linux distributions.

2. Find a different PTE bit to (re)use.

3. Avoid clearing the soft dirty bit when repopulating a swapped out page.

4. Redesign the soft dirty tracking to not require the use of
architecture specific PTE bits.  e.g., by using a shadow set of
structures for the soft dirty bit tracking.


Or revert this patch and in 3.12 fix it using one of the options above 
or other ones.


David


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread David Vrabel
All,

179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
and _PTE_PAT.

With a Xen PV guest, the use of the _PTE_PAT will result in the page
having unexpected cachability which will introduce a range of subtle
performance and correctness issues.  Xen programs the entry 4 in the PAT
table with WC so a page that was previously WB will end up as WC.

The use of this bit also appears to preclude the use of (transparent)
huge pages by the application.  It is not clear if there is something
else guaranteeing that that there will be no huge pages.

To fix this regression I suggest one or more of:

1. If no other changes are made, at a mimimum, MEM_SOFT_DIRTY must
require !XEN and possibly !TRANSPARENT_HUGEPAGE and !HUGETLBFS.  This
would prevent this option being enabled on the majority of standard
Linux distributions.

2. Find a different PTE bit to (re)use.

3. Avoid clearing the soft dirty bit when repopulating a swapped out page.

4. Redesign the soft dirty tracking to not require the use of
architecture specific PTE bits.  e.g., by using a shadow set of
structures for the soft dirty bit tracking.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread David Vrabel
All,

179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
and _PTE_PAT.

With a Xen PV guest, the use of the _PTE_PAT will result in the page
having unexpected cachability which will introduce a range of subtle
performance and correctness issues.  Xen programs the entry 4 in the PAT
table with WC so a page that was previously WB will end up as WC.

The use of this bit also appears to preclude the use of (transparent)
huge pages by the application.  It is not clear if there is something
else guaranteeing that that there will be no huge pages.

To fix this regression I suggest one or more of:

1. If no other changes are made, at a mimimum, MEM_SOFT_DIRTY must
require !XEN and possibly !TRANSPARENT_HUGEPAGE and !HUGETLBFS.  This
would prevent this option being enabled on the majority of standard
Linux distributions.

2. Find a different PTE bit to (re)use.

3. Avoid clearing the soft dirty bit when repopulating a swapped out page.

4. Redesign the soft dirty tracking to not require the use of
architecture specific PTE bits.  e.g., by using a shadow set of
structures for the soft dirty bit tracking.

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread konrad wilk


On 8/21/2013 9:48 AM, David Vrabel wrote:

All,

179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
and _PTE_PAT.

With a Xen PV guest, the use of the _PTE_PAT will result in the page
having unexpected cachability which will introduce a range of subtle
performance and correctness issues.  Xen programs the entry 4 in the PAT
table with WC so a page that was previously WB will end up as WC.
Especially with filesystems which would end up using those pages (as the 
memory allocator
would recycle them) and with corruption in the filesystem. Took months 
to figure

that out.



The use of this bit also appears to preclude the use of (transparent)
huge pages by the application.  It is not clear if there is something
else guaranteeing that that there will be no huge pages.

To fix this regression I suggest one or more of:

1. If no other changes are made, at a mimimum, MEM_SOFT_DIRTY must
require !XEN and possibly !TRANSPARENT_HUGEPAGE and !HUGETLBFS.  This
would prevent this option being enabled on the majority of standard
Linux distributions.

2. Find a different PTE bit to (re)use.

3. Avoid clearing the soft dirty bit when repopulating a swapped out page.

4. Redesign the soft dirty tracking to not require the use of
architecture specific PTE bits.  e.g., by using a shadow set of
structures for the soft dirty bit tracking.


Or revert this patch and in 3.12 fix it using one of the options above 
or other ones.


David


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 02:48:20PM +0100, David Vrabel wrote:
 All,
 
 179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
 PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
 and _PTE_PAT.
 
 With a Xen PV guest, the use of the _PTE_PAT will result in the page
 having unexpected cachability which will introduce a range of subtle
 performance and correctness issues.  Xen programs the entry 4 in the PAT
 table with WC so a page that was previously WB will end up as WC.
 

David, could you please explain, Xen keeps and analyze _PTE_PAT bit
for ptes which are not present?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread H. Peter Anvin
Eep.  This should be reverted, indeed.  This isn't a manifest bug on !Xen but 
we have gotten requests for WT support which would mean adding in the PAT but 
again.

David Vrabel david.vra...@citrix.com wrote:
All,

179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
and _PTE_PAT.

With a Xen PV guest, the use of the _PTE_PAT will result in the page
having unexpected cachability which will introduce a range of subtle
performance and correctness issues.  Xen programs the entry 4 in the
PAT
table with WC so a page that was previously WB will end up as WC.

The use of this bit also appears to preclude the use of (transparent)
huge pages by the application.  It is not clear if there is something
else guaranteeing that that there will be no huge pages.

To fix this regression I suggest one or more of:

1. If no other changes are made, at a mimimum, MEM_SOFT_DIRTY must
require !XEN and possibly !TRANSPARENT_HUGEPAGE and !HUGETLBFS.  This
would prevent this option being enabled on the majority of standard
Linux distributions.

2. Find a different PTE bit to (re)use.

3. Avoid clearing the soft dirty bit when repopulating a swapped out
page.

4. Redesign the soft dirty tracking to not require the use of
architecture specific PTE bits.  e.g., by using a shadow set of
structures for the soft dirty bit tracking.

David

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 04:11:25PM +0200, H. Peter Anvin wrote:

 Eep.  This should be reverted, indeed.  This isn't a manifest bug on !Xen but 
 we
 have gotten requests for WT support which would mean adding in the PAT but 
 again.

Please no, letme fix it. That's what I'm having in mind: don't use pse bit in 
swap
entry but always mark page read from swap as dirty one (it's better than having
no tracker at all and  will be the case on old machines with 32bit ptes only).
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread H. Peter Anvin
OK now I'm confused.  I guess I shouldn't comment while on vacation and cache 
cold on everything.

Cyrill Gorcunov gorcu...@gmail.com wrote:
On Wed, Aug 21, 2013 at 04:11:25PM +0200, H. Peter Anvin wrote:

 Eep.  This should be reverted, indeed.  This isn't a manifest bug on
!Xen but we
 have gotten requests for WT support which would mean adding in the
PAT but again.

Please no, letme fix it. That's what I'm having in mind: don't use pse
bit in swap
entry but always mark page read from swap as dirty one (it's better
than having
no tracker at all and  will be the case on old machines with 32bit ptes
only).

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread H. Peter Anvin
Good question...

Cyrill Gorcunov gorcu...@gmail.com wrote:
On Wed, Aug 21, 2013 at 02:48:20PM +0100, David Vrabel wrote:
 All,
 
 179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
 PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as
_PTE_PSE
 and _PTE_PAT.
 
 With a Xen PV guest, the use of the _PTE_PAT will result in the page
 having unexpected cachability which will introduce a range of subtle
 performance and correctness issues.  Xen programs the entry 4 in the
PAT
 table with WC so a page that was previously WB will end up as WC.
 

David, could you please explain, Xen keeps and analyze _PTE_PAT bit
for ptes which are not present?

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 04:22:21PM +0200, H. Peter Anvin wrote:
 OK now I'm confused.  I guess I shouldn't comment while on vacation
 and cache cold on everything.

I rather think I'm missing something, that's why I asked David how
this featue affects non present pte.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Jan Beulich
 On 21.08.13 at 16:12, Cyrill Gorcunov gorcu...@gmail.com wrote:
 On Wed, Aug 21, 2013 at 02:48:20PM +0100, David Vrabel wrote:
 All,
 
 179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
 PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
 and _PTE_PAT.
 
 With a Xen PV guest, the use of the _PTE_PAT will result in the page
 having unexpected cachability which will introduce a range of subtle
 performance and correctness issues.  Xen programs the entry 4 in the PAT
 table with WC so a page that was previously WB will end up as WC.
 
 
 David, could you please explain, Xen keeps and analyze _PTE_PAT bit
 for ptes which are not present?

No, the problem isn't with not-present PTEs (i.e. swap entries),
but with present ones - the same bit (7) is being used for both,
according to this comment:

/*
 * Tracking soft dirty bit when a page goes to a swap is tricky.
 * We need a bit which can be stored in pte _and_ not conflict
 * with swap entry format. On x86 bits 6 and 7 are *not* involved
 * into swap entry computation, but bit 6 is used for nonlinear
 * file mapping, so we borrow bit 7 for soft dirty tracking.
 */

Or are you telling me that the comment is misleading (at least me),
and this applies only to not-present PTEs? And even then - where
would the value of the original PAT bit be stored while swapped
out (or is it impossible - now and forever - for WC pages to get
swapped)?

Jan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread H. Peter Anvin
Only WB pages should be swappable, but even so, the cacheability should be in 
the vma.

Jan Beulich jbeul...@suse.com wrote:
 On 21.08.13 at 16:12, Cyrill Gorcunov gorcu...@gmail.com wrote:
 On Wed, Aug 21, 2013 at 02:48:20PM +0100, David Vrabel wrote:
 All,
 
 179ef71c (mm: save soft-dirty bits on swapped pages) introduces a
new
 PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as
_PTE_PSE
 and _PTE_PAT.
 
 With a Xen PV guest, the use of the _PTE_PAT will result in the page
 having unexpected cachability which will introduce a range of subtle
 performance and correctness issues.  Xen programs the entry 4 in the
PAT
 table with WC so a page that was previously WB will end up as WC.
 
 
 David, could you please explain, Xen keeps and analyze _PTE_PAT bit
 for ptes which are not present?

No, the problem isn't with not-present PTEs (i.e. swap entries),
but with present ones - the same bit (7) is being used for both,
according to this comment:

/*
 * Tracking soft dirty bit when a page goes to a swap is tricky.
 * We need a bit which can be stored in pte _and_ not conflict
 * with swap entry format. On x86 bits 6 and 7 are *not* involved
 * into swap entry computation, but bit 6 is used for nonlinear
 * file mapping, so we borrow bit 7 for soft dirty tracking.
 */

Or are you telling me that the comment is misleading (at least me),
and this applies only to not-present PTEs? And even then - where
would the value of the original PAT bit be stored while swapped
out (or is it impossible - now and forever - for WC pages to get
swapped)?

Jan

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 03:53:36PM +0100, Jan Beulich wrote:
  On 21.08.13 at 16:12, Cyrill Gorcunov gorcu...@gmail.com wrote:
  On Wed, Aug 21, 2013 at 02:48:20PM +0100, David Vrabel wrote:
  All,
  
  179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
  PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
  and _PTE_PAT.
  
  With a Xen PV guest, the use of the _PTE_PAT will result in the page
  having unexpected cachability which will introduce a range of subtle
  performance and correctness issues.  Xen programs the entry 4 in the PAT
  table with WC so a page that was previously WB will end up as WC.
  
  
  David, could you please explain, Xen keeps and analyze _PTE_PAT bit
  for ptes which are not present?
 
 No, the problem isn't with not-present PTEs (i.e. swap entries),
 but with present ones - the same bit (7) is being used for both,
 according to this comment:
 
 /*
  * Tracking soft dirty bit when a page goes to a swap is tricky.
  * We need a bit which can be stored in pte _and_ not conflict
  * with swap entry format. On x86 bits 6 and 7 are *not* involved
  * into swap entry computation, but bit 6 is used for nonlinear
  * file mapping, so we borrow bit 7 for soft dirty tracking.
  */
 
 Or are you telling me that the comment is misleading (at least me),
 and this applies only to not-present PTEs? And even then - where
 would the value of the original PAT bit be stored while swapped
 out (or is it impossible - now and forever - for WC pages to get
 swapped)?

Only to non-present ptes, as far as I know.

do_swap_page
...
pte = mk_pte(page, vma-vm_page_prot);

/* new pte from vm_page_prot generated */
...
set_pte_at(mm, address, page_table, pte);
/* and assigned to old place */

with soft dirty in swap it is somehow more weirdy

pte = mk_pte(page, vma-vm_page_prot);
...
if (pte_swp_soft_dirty(orig_pte))
pte = pte_mksoft_dirty(pte);
set_pte_at(mm, address, page_table, pte);

orig_pte has pse bit set if page has been soft dirty
when it reached swap.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Jan Beulich
 On 21.08.13 at 17:42, Cyrill Gorcunov gorcu...@gmail.com wrote:
 On Wed, Aug 21, 2013 at 03:53:36PM +0100, Jan Beulich wrote:
  On 21.08.13 at 16:12, Cyrill Gorcunov gorcu...@gmail.com wrote:
  David, could you please explain, Xen keeps and analyze _PTE_PAT bit
  for ptes which are not present?
 
 No, the problem isn't with not-present PTEs (i.e. swap entries),
 but with present ones - the same bit (7) is being used for both,
 according to this comment:
 
 /*
  * Tracking soft dirty bit when a page goes to a swap is tricky.
  * We need a bit which can be stored in pte _and_ not conflict
  * with swap entry format. On x86 bits 6 and 7 are *not* involved
  * into swap entry computation, but bit 6 is used for nonlinear
  * file mapping, so we borrow bit 7 for soft dirty tracking.
  */
 
 Or are you telling me that the comment is misleading (at least me),
 and this applies only to not-present PTEs? And even then - where
 would the value of the original PAT bit be stored while swapped
 out (or is it impossible - now and forever - for WC pages to get
 swapped)?
 
 Only to non-present ptes, as far as I know.

That's not really any guarantee. And the accessor functions also
don't check that they'd be used on non-present PTEs only.

 do_swap_page
   ...
   pte = mk_pte(page, vma-vm_page_prot);
 
   /* new pte from vm_page_prot generated */
   ...
   set_pte_at(mm, address, page_table, pte);
   /* and assigned to old place */
 
 with soft dirty in swap it is somehow more weirdy
 
   pte = mk_pte(page, vma-vm_page_prot);
   ...
   if (pte_swp_soft_dirty(orig_pte))
   pte = pte_mksoft_dirty(pte);
   set_pte_at(mm, address, page_table, pte);
 
 orig_pte has pse bit set if page has been soft dirty
 when it reached swap.

when it reached swap to me again implies that it could come
from a live page table, with the present bit set. So that
explanation attempt of yours confuses me more than it
clarifies things for me. (And referring to this bit as PSE bit is
sort of wrong here too - there's no PSE bit for 4k PTEs, that
bit is the PAT one, and that's what the whole discussion
started from.)

Jan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 05:03:13PM +0100, Jan Beulich wrote:
  
  Only to non-present ptes, as far as I know.
 
 That's not really any guarantee. And the accessor functions also
 don't check that they'd be used on non-present PTEs only.

Wait. This _PAGE_SWP_SOFT_DIRTY bit (which is in real PSE bit) assigned
in only one place -- in try_to_unmap_one(). The PTE get non-present then
and consists of swap entry format. I don't see any accessor to such entry
without testing if it's swap entry or pte-none. What I'm missing?

  orig_pte has pse bit set if page has been soft dirty
  when it reached swap.
 
 when it reached swap to me again implies that it could come
 from a live page table, with the present bit set. So that
 explanation attempt of yours confuses me more than it
 clarifies things for me. (And referring to this bit as PSE bit is

When page swapped out it become non-present in pte entry.

 sort of wrong here too - there's no PSE bit for 4k PTEs, that
 bit is the PAT one, and that's what the whole discussion
 started from.)

And I asked David to point me how it happens, because I don't
understand at which point pse bit get analized when page is
not present.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Linus Torvalds
On Wed, Aug 21, 2013 at 7:19 AM, Cyrill Gorcunov gorcu...@gmail.com wrote:
 On Wed, Aug 21, 2013 at 04:11:25PM +0200, H. Peter Anvin wrote:

 Eep.  This should be reverted, indeed.  This isn't a manifest bug on !Xen 
 but we
 have gotten requests for WT support which would mean adding in the PAT but 
 again.

 Please no, letme fix it. That's what I'm having in mind: don't use pse bit in 
 swap
 entry but always mark page read from swap as dirty one (it's better than 
 having
 no tracker at all and  will be the case on old machines with 32bit ptes only).

Quite frankly, unless I see a patch later today that is

 (a) obvious
 (b) explains what is going on
 (c) tested

I will be reverting the whole soft-dirty mess. I thought the
bit-mapping games it played were already too complicated (the patch to
pgtable-2level.h in commit 41bb3476b361 just makes me want to barf and
came in very late, so I'm not positive about the whole soft-dirty mess
in the first place). I really am not at all inclined to want to play
games in this area any more. It's too damn late in the release window.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 09:30:03AM -0700, Linus Torvalds wrote:
 On Wed, Aug 21, 2013 at 7:19 AM, Cyrill Gorcunov gorcu...@gmail.com wrote:
  On Wed, Aug 21, 2013 at 04:11:25PM +0200, H. Peter Anvin wrote:
 
  Eep.  This should be reverted, indeed.  This isn't a manifest bug on !Xen 
  but we
  have gotten requests for WT support which would mean adding in the PAT but 
  again.
 
  Please no, letme fix it. That's what I'm having in mind: don't use pse bit 
  in swap
  entry but always mark page read from swap as dirty one (it's better than 
  having
  no tracker at all and  will be the case on old machines with 32bit ptes 
  only).
 
 Quite frankly, unless I see a patch later today that is

I'll do my best to fix it, but I've not yet been shown where the problem
happens. Neither I'm convinced if it is here at all. I'm asking David for 
details.

 
  (a) obvious
  (b) explains what is going on
  (c) tested
 
 I will be reverting the whole soft-dirty mess. I thought the
 bit-mapping games it played were already too complicated (the patch to
 pgtable-2level.h in commit 41bb3476b361 just makes me want to barf and
 came in very late, so I'm not positive about the whole soft-dirty mess

This was not my invention on such encoding, I simply had no choise but
follow existing scheme. Still there are cleanup patches in -mm tree which
make code a way more readable (even former one, without soft-dirty bit).

 in the first place). I really am not at all inclined to want to play
 games in this area any more. It's too damn late in the release window.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread David Vrabel
On 21/08/13 17:19, Cyrill Gorcunov wrote:
 On Wed, Aug 21, 2013 at 05:03:13PM +0100, Jan Beulich wrote:

 Only to non-present ptes, as far as I know.

 That's not really any guarantee. And the accessor functions also
 don't check that they'd be used on non-present PTEs only.
 
 Wait. This _PAGE_SWP_SOFT_DIRTY bit (which is in real PSE bit) assigned
 in only one place -- in try_to_unmap_one(). The PTE get non-present then
 and consists of swap entry format. I don't see any accessor to such entry
 without testing if it's swap entry or pte-none. What I'm missing?
 
 orig_pte has pse bit set if page has been soft dirty
 when it reached swap.

 when it reached swap to me again implies that it could come
 from a live page table, with the present bit set. So that
 explanation attempt of yours confuses me more than it
 clarifies things for me. (And referring to this bit as PSE bit is
 
 When page swapped out it become non-present in pte entry.
 
 sort of wrong here too - there's no PSE bit for 4k PTEs, that
 bit is the PAT one, and that's what the whole discussion
 started from.)
 
 And I asked David to point me how it happens, because I don't
 understand at which point pse bit get analized when page is
 not present.

As Jan said, we're concerned that the bit was being used on present PTEs
and not just non-present ones.  From a more careful look at this code
this does not appear to be the case.

However, I do find the use of PTE bits in this way somewhat fragile.
What other potential corner cases might still remain that will require
further games with PTE bits?

FWIW, Xen uses a separate dirty log to track which pages have become
dirty since the log was last cleared.  Such a dirty log seems more
efficient than having scan all the PTEs looking for the soft dirty bits
and then having to scan them all again to clear them (particularly if
you need multiple passes because the task is still running and
continuing to dirty pages).

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 05:56:08PM +0100, David Vrabel wrote:
  
  And I asked David to point me how it happens, because I don't
  understand at which point pse bit get analized when page is
  not present.
 
 As Jan said, we're concerned that the bit was being used on present PTEs
 and not just non-present ones.  From a more careful look at this code
 this does not appear to be the case.
 
 However, I do find the use of PTE bits in this way somewhat fragile.
 What other potential corner cases might still remain that will require
 further games with PTE bits?

  OK, so this is not a bug finally. The problem is that 2 level pte is
quite small and 7th bit is the only one spare I can use for soft dirty
tracking when page get swapped out. And swap engine is very depending
on pte being non-present, so we are on a safe side.

 FWIW, Xen uses a separate dirty log to track which pages have become
 dirty since the log was last cleared.  Such a dirty log seems more
 efficient than having scan all the PTEs looking for the soft dirty bits
 and then having to scan them all again to clear them (particularly if
 you need multiple passes because the task is still running and
 continuing to dirty pages).
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Andy Lutomirski
On Wed, Aug 21, 2013 at 6:48 AM, David Vrabel david.vra...@citrix.com wrote:
 All,

 179ef71c (mm: save soft-dirty bits on swapped pages) introduces a new
 PTE bit on x86 _PTE_SWP_SOFT_DIRTY which has the same value as _PTE_PSE
 and _PTE_PAT.

 With a Xen PV guest, the use of the _PTE_PAT will result in the page
 having unexpected cachability which will introduce a range of subtle
 performance and correctness issues.  Xen programs the entry 4 in the PAT
 table with WC so a page that was previously WB will end up as WC.


Kind of off topic, but do you have a summary of how Xen uses the high
PAT bits?  I'm the one who wants WT, and if there's already precedent
for using the high PAT bits, it'll be helpful.

(Also, Xen is a little weird, and I'd rather not break it again by
accident like I did with the vsyscall changes a couple years ago.)

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 09:25:47PM +0400, Cyrill Gorcunov wrote:
  
  However, I do find the use of PTE bits in this way somewhat fragile.
  What other potential corner cases might still remain that will require
  further games with PTE bits?
 
   OK, so this is not a bug finally. The problem is that 2 level pte is
 quite small and 7th bit is the only one spare I can use for soft dirty
 tracking when page get swapped out. And swap engine is very depending
 on pte being non-present, so we are on a safe side.

To make it clear: I'm working on patch that won't use pse bit for dirty
page tracking, just give some time to cook it and test.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread H. Peter Anvin
But is there a manifest bug or not?  What is the deal with Xen?

Cyrill Gorcunov gorcu...@gmail.com wrote:
On Wed, Aug 21, 2013 at 09:25:47PM +0400, Cyrill Gorcunov wrote:
  
  However, I do find the use of PTE bits in this way somewhat
fragile.
  What other potential corner cases might still remain that will
require
  further games with PTE bits?
 
   OK, so this is not a bug finally. The problem is that 2 level pte
is
 quite small and 7th bit is the only one spare I can use for soft
dirty
 tracking when page get swapped out. And swap engine is very depending
 on pte being non-present, so we are on a safe side.

To make it clear: I'm working on patch that won't use pse bit for dirty
page tracking, just give some time to cook it and test.

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 08:50:26PM +0200, H. Peter Anvin wrote:
 But is there a manifest bug or not?  What is the deal with Xen?
 

I personally don't see bug here because

 - this swapped page soft dirty bit is set for non-present entries only,
   never for present ones, just at moment we form swap pte entry

 - i don't find any code which would test for this bit directly without
   is_swap_pte call

but the use of paw bit itself is confusing, so I'm working on patch which
won't use it. Again, if someone knows where exactly access to pse bit when
pte keeps swap entry may happen (for any purpose other than dirty page
tracking) please share.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Andy Lutomirski
On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov gorcu...@gmail.com wrote:
 On Wed, Aug 21, 2013 at 08:50:26PM +0200, H. Peter Anvin wrote:
 But is there a manifest bug or not?  What is the deal with Xen?


 I personally don't see bug here because

  - this swapped page soft dirty bit is set for non-present entries only,
never for present ones, just at moment we form swap pte entry

  - i don't find any code which would test for this bit directly without
is_swap_pte call

 but the use of paw bit itself is confusing, so I'm working on patch which
 won't use it. Again, if someone knows where exactly access to pse bit when
 pte keeps swap entry may happen (for any purpose other than dirty page
 tracking) please share.

I doubt that there are cacheability issues here, since the swap type
already overlaps with PWT and PCD.

What happens if the page being swapped is a THP page?  (I have no idea
how, or even if, this works, but presumably PSE is important.)

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 12:07:13PM -0700, Andy Lutomirski wrote:
 On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov gorcu...@gmail.com wrote:
  On Wed, Aug 21, 2013 at 08:50:26PM +0200, H. Peter Anvin wrote:
  But is there a manifest bug or not?  What is the deal with Xen?
 
 
  I personally don't see bug here because
 
   - this swapped page soft dirty bit is set for non-present entries only,
 never for present ones, just at moment we form swap pte entry
 
   - i don't find any code which would test for this bit directly without
 is_swap_pte call
 
  but the use of paw bit itself is confusing, so I'm working on patch which
  won't use it. Again, if someone knows where exactly access to pse bit when
  pte keeps swap entry may happen (for any purpose other than dirty page
  tracking) please share.
 
 I doubt that there are cacheability issues here, since the swap type
 already overlaps with PWT and PCD.
 
 What happens if the page being swapped is a THP page?  (I have no idea
 how, or even if, this works, but presumably PSE is important.)

add_to_swap - split_huge_page_to_list, so thp will be splitted.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Pavel Emelyanov
On 08/21/2013 11:07 PM, Andy Lutomirski wrote:
 On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov gorcu...@gmail.com wrote:
 On Wed, Aug 21, 2013 at 08:50:26PM +0200, H. Peter Anvin wrote:
 But is there a manifest bug or not?  What is the deal with Xen?


 I personally don't see bug here because

  - this swapped page soft dirty bit is set for non-present entries only,
never for present ones, just at moment we form swap pte entry

  - i don't find any code which would test for this bit directly without
is_swap_pte call

 but the use of paw bit itself is confusing, so I'm working on patch which
 won't use it. Again, if someone knows where exactly access to pse bit when
 pte keeps swap entry may happen (for any purpose other than dirty page
 tracking) please share.
 
 I doubt that there are cacheability issues here, since the swap type
 already overlaps with PWT and PCD.
 
 What happens if the page being swapped is a THP page?  (I have no idea
 how, or even if, this works, but presumably PSE is important.)

Huge-page is splitted into small ones, then the smaller ones get swapped out.

 --Andy
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Linus Torvalds
On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov gorcu...@gmail.com wrote:

 I personally don't see bug here because

  - this swapped page soft dirty bit is set for non-present entries only,
never for present ones, just at moment we form swap pte entry

  - i don't find any code which would test for this bit directly without
is_swap_pte call

Ok, having gone through the places that use swp_*soft_dirty(), I have
to agree. Afaik, it's only ever used on a swap-entry that has (by
definition) the P bit clear. So with or without Xen, I don't see how
it can make any difference.

David/Konrad - did you actually see any issues, or was this just from
(mis)reading the code?

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 09:30:03AM -0700, Linus Torvalds wrote:
 Quite frankly, unless I see a patch later today that is
 
  (a) obvious
  (b) explains what is going on
  (c) tested
 
 I will be reverting the whole soft-dirty mess. I thought the
 bit-mapping games it played were already too complicated (the patch to
 pgtable-2level.h in commit 41bb3476b361 just makes me want to barf and
 came in very late, so I'm not positive about the whole soft-dirty mess
 in the first place). I really am not at all inclined to want to play
 games in this area any more. It's too damn late in the release window.

Hi all, I worked on patch which would not touch PSE bit for dirty page
tracking and the result is not that good:

 - 2level pages now always page dirty if page is swapped in and out, because
   there is no space left in PTE (other than PSE bit)

 - only 3level pages scheme uses high 32bits to keep offset of swap entry,
   x86-64 shifts offset up to _PAGE_BIT_GLOBAL + 1 bit, thus I need some
   different bit nonunified with anything else for no reason :(

Summarizing all things

 - Using PSE bit for swap entries as indicator of soft dirty page is safe 
because
   swap entries as saved in pte as non-presen and when #pf happens kernel 
generates
   valid pte entry from vma-vm_page_prot

 - __swp_entry() helper is clearing PSE bit explicitly so even without softdirty
   patch it's not saved once page reach swap (with softdirty tracking we simply
   reuse this bit for own needs).

 - Using PSE bit allows to not modify swap encoding on all 3 page schemes 
(2level,
   3level, 4level) because it's a spare bit there not intersected with swap 
format.

Thus I would *_really_* like to save current scheme. Probably I should add 
comment
into header where _PAGE_SWP_SOFT_DIRTY defined that it's valid only when PRESENT
bit clear? Similar to

/* If _PAGE_BIT_PRESENT is clear, we use these: */
/* - if the user mapped it with PROT_NONE; pte_present gives true */
#define _PAGE_BIT_PROTNONE  _PAGE_BIT_GLOBAL
/* - set: nonlinear file mapping, saved PTE; unset:swap */
#define _PAGE_BIT_FILE  _PAGE_BIT_DIRTY

Have I conviced you guys?

The former problem report came from impression that this PSE bit may be touched
(set and clean) on present PTE, but it's not the case for pages being swapped.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Andi Kleen
Cyrill Gorcunov gorcu...@gmail.com writes:

 Hi all, I worked on patch which would not touch PSE bit for dirty page
 tracking and the result is not that good:

  - 2level pages now always page dirty if page is swapped in and out, because
there is no space left in PTE (other than PSE bit)

Maybe just don't support soft dirty for 2 level page tables?

2 level page tables should be really on the way out anyways, as they
have severe limits and do not support NX. With 3 levels there is enough
space.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Dave Jones
On Wed, Aug 21, 2013 at 04:04:54PM -0700, Linus Torvalds wrote:
  On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov gorcu...@gmail.com wrote:
  
   I personally don't see bug here because
  
- this swapped page soft dirty bit is set for non-present entries only,
  never for present ones, just at moment we form swap pte entry
  
- i don't find any code which would test for this bit directly without
  is_swap_pte call
  
  Ok, having gone through the places that use swp_*soft_dirty(), I have
  to agree. Afaik, it's only ever used on a swap-entry that has (by
  definition) the P bit clear. So with or without Xen, I don't see how
  it can make any difference.
  
  David/Konrad - did you actually see any issues, or was this just from
  (mis)reading the code?

Could this explain what I'm seeing in another thread ?
https://lkml.org/lkml/2013/8/7/27

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 08:51:15PM -0400, Dave Jones wrote:
 
 Could this explain what I'm seeing in another thread ?
 https://lkml.org/lkml/2013/8/7/27

Don't think so. This code is merged in -rc6, while your report is saying the
kernel version is -rc4 (also this feature requires CONFIG_MEM_SOFT_DIRTY
to be turned on, do you have it?).
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use

2013-08-21 Thread Cyrill Gorcunov
On Wed, Aug 21, 2013 at 04:42:53PM -0700, Andi Kleen wrote:
 Cyrill Gorcunov gorcu...@gmail.com writes:
 
  Hi all, I worked on patch which would not touch PSE bit for dirty page
  tracking and the result is not that good:
 
   - 2level pages now always page dirty if page is swapped in and out, because
 there is no space left in PTE (other than PSE bit)
 
 Maybe just don't support soft dirty for 2 level page tables?
 
 2 level page tables should be really on the way out anyways, as they
 have severe limits and do not support NX. With 3 levels there is enough
 space.

Look, good thing is that 7th bit also available on the 4level pages
(ie x86-64) without additional code modification, that's why I picked
it in first place. I prepared the patch locally which doesn't use
pse bit for tracking but it only makes code more complex.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/