On Fri, 16 Feb 2001, Linus Torvalds wrote:
> This is, actually, a problem that I suspect ends up being _very_ similar
> to the zap_page_range() case. zap_page_range() needs to make sure that
> everything has been updated by the time the page is actually free'd. While
> filemap_sync() needs to
On Fri, 16 Feb 2001, Ben LaHaise wrote:
>
> Actually, in the filemap_sync case, the flush_tlb_page is redundant --
> there's already a call to flush_tlb_range in filemap_sync after the dirty
> bits are cleared.
This is not enough.
If another CPU has started write-out of one of the dirty
On Fri, 16 Feb 2001, Manfred Spraul wrote:
> That leaves msync() - it currently does a flush_tlb_page() for every
> single dirty page.
> Is it possible to integrate that into the mmu gather code?
>
> tlb_transfer_dirty() in addition to tlb_clear_page()?
Actually, in the filemap_sync case, the
On Fri, 16 Feb 2001, Manfred Spraul wrote:
>
> That leaves msync() - it currently does a flush_tlb_page() for every
> single dirty page.
> Is it possible to integrate that into the mmu gather code?
Not even necessary.
The D bit does not have to be coherent. We need to make sure that we flush
Linus wrote:
>
> >
> > That second pass is what I had in mind.
> >
> > > * munmap(file): No. Second pass required for correct msync behaviour.
> >
> > It is?
>
> Not now it isn't. We just do a msync() + fsync() for msync(MS_SYNC). Which
> is admittedly not optimal, but it works.
>
Ok, munmap()
On Fri, 16 Feb 2001, Jamie Lokier wrote:
>
> > And check the Pentium III erratas. There is one with the tlb
> > that's only triggered if 4 instruction lie in a certain window and all
> > access memory in the same way of the tlb (EFLAGS incorrect if 'andl
> > mask,' causes page fault)).
>
>
Manfred Spraul wrote:
> A very simple test might be
>
> cpu 1:
> cpu 2:
Ben's test uses only one CPU.
> Now start with variants:
> change to read only instead of not present
> a and b in the same way of the tlb, in a different way.
> change pte with write, change with lock;
> .
> .
> .
>
>
Jamie Lokier wrote:
>
> > > Ben, fancy writing a boot-time test?
> > >
> > I'd never rely on such a test - what if the cpu checks in 99% of the
> > cases, but doesn't handle some cases ('rep movd, everything unaligned,
> > ...'.
>
> A good point. The test results are inconclusive.
>
> > And
On Fri, 16 Feb 2001, Manfred Spraul wrote:
> Jamie Lokier wrote:
> >
> > Linus Torvalds wrote:
> > > So the only case that ends up being fairly heavy may be a case that is
> > > very uncommon in practice (only for unmapping shared mappings in
> > > threaded programs or the lazy TLB case).
> >
On Fri, 16 Feb 2001, Linus Torvalds wrote:
> How do you expect to ever see this in practice? Sounds basically
> impossible to test for this hardware race. The obvious "try to dirty as
> fast as possible on one CPU while doing an atomic get-and-clear on the
> other" thing is not valid - it's in
On Fri, 16 Feb 2001, Ben LaHaise wrote:
> On Fri, 16 Feb 2001, Jamie Lokier wrote:
>
> > It should be fast on known CPUs, correct on unknown ones, and much
> > simpler than "gather" code which may be completely unnecessary and
> > rather difficult to test.
> >
> > If anyone reports the
> > Ben, fancy writing a boot-time test?
> >
> I'd never rely on such a test - what if the cpu checks in 99% of the
> cases, but doesn't handle some cases ('rep movd, everything unaligned,
> ...'.
A good point. The test results are inconclusive.
> And check the Pentium III erratas. There is
On Fri, 16 Feb 2001, Jamie Lokier wrote:
> Manfred Spraul wrote:
> > Ok, Is there one case were your pragmatic solutions is vastly faster?
>
> > * mprotect: No. The difference is at most one additional locked
> > instruction for each pte.
>
> Oh, what instruction is that?
The "set_pte()"
On Fri, 16 Feb 2001, Jamie Lokier wrote:
> It should be fast on known CPUs, correct on unknown ones, and much
> simpler than "gather" code which may be completely unnecessary and
> rather difficult to test.
>
> If anyone reports the message, _then_ we think about the problem some more.
>
> Ben,
Manfred Spraul wrote:
> Ok, Is there one case were your pragmatic solutions is vastly faster?
> * mprotect: No. The difference is at most one additional locked
> instruction for each pte.
Oh, what instruction is that?
> * munmap(anon): No. We must handle delayed accessed anyway (don't call
>
Jamie Lokier wrote:
>
> Manfred Spraul wrote:
> > The other cpu writes the dirty bit - we just overwrite it ;-)
> > After the ptep_get_and_clear(), before the set_pte().
>
> Ah, I see. The other CPU does an atomic *pte |= _PAGE_DIRTY, without
> checking the present bit. ('scuse me for
Manfred Spraul wrote:
> The other cpu writes the dirty bit - we just overwrite it ;-)
> After the ptep_get_and_clear(), before the set_pte().
Ah, I see. The other CPU does an atomic *pte |= _PAGE_DIRTY, without
checking the present bit. ('scuse me for temporary brain failure).
How about a
Jamie Lokier wrote:
>
> And how does that lose a dirty bit?
>
> For the other processor to not write a dirty bit, it must have a dirty
^^^
> TLB entry already which, along with the locked cycle in
> ptep_get_and_clear, means that `entry' will have _PAGE_DIRTY
Manfred Spraul wrote:
> > entry = ptep_get_and_clear(pte);
> > set_pte(pte, pte_modify(entry, newprot));
> >
> > I.e. the only code with the race condition is code which explicitly
> > clears the dirty bit, in vmscan.c.
> >
> > Do you see any possibility of losing a dirty bit
Jamie Lokier wrote:
>
> /* mprotect.c */
> entry = ptep_get_and_clear(pte);
> set_pte(pte, pte_modify(entry, newprot));
>
> I.e. the only code with the race condition is code which explicitly
> clears the dirty bit, in vmscan.c.
>
> Do you see any possibility of losing a dirty
Manfred Spraul wrote:
> > I can think of one case where performance is considered quite important:
> > mprotect() is used by several garbage collectors, including threaded
> > ones. Maybe mprotect() isn't the best primitive for those anyway, but
> > it's what they have to work with atm.
>
>
Jamie Lokier wrote:
>
> Linus Torvalds wrote:
> > So the only case that ends up being fairly heavy may be a case that is
> > very uncommon in practice (only for unmapping shared mappings in
> > threaded programs or the lazy TLB case).
>
The lazy tlb case is quite fast: lazy tlb thread never
Linus Torvalds wrote:
> So the only case that ends up being fairly heavy may be a case that is
> very uncommon in practice (only for unmapping shared mappings in
> threaded programs or the lazy TLB case).
I can think of one case where performance is considered quite important:
mprotect() is used
Linus Torvalds wrote:
So the only case that ends up being fairly heavy may be a case that is
very uncommon in practice (only for unmapping shared mappings in
threaded programs or the lazy TLB case).
I can think of one case where performance is considered quite important:
mprotect() is used by
Jamie Lokier wrote:
Linus Torvalds wrote:
So the only case that ends up being fairly heavy may be a case that is
very uncommon in practice (only for unmapping shared mappings in
threaded programs or the lazy TLB case).
The lazy tlb case is quite fast: lazy tlb thread never write to user
Manfred Spraul wrote:
I can think of one case where performance is considered quite important:
mprotect() is used by several garbage collectors, including threaded
ones. Maybe mprotect() isn't the best primitive for those anyway, but
it's what they have to work with atm.
Does
Jamie Lokier wrote:
/* mprotect.c */
entry = ptep_get_and_clear(pte);
set_pte(pte, pte_modify(entry, newprot));
I.e. the only code with the race condition is code which explicitly
clears the dirty bit, in vmscan.c.
Do you see any possibility of losing a dirty bit here?
Manfred Spraul wrote:
entry = ptep_get_and_clear(pte);
set_pte(pte, pte_modify(entry, newprot));
I.e. the only code with the race condition is code which explicitly
clears the dirty bit, in vmscan.c.
Do you see any possibility of losing a dirty bit here?
Of
Manfred Spraul wrote:
The other cpu writes the dirty bit - we just overwrite it ;-)
After the ptep_get_and_clear(), before the set_pte().
Ah, I see. The other CPU does an atomic *pte |= _PAGE_DIRTY, without
checking the present bit. ('scuse me for temporary brain failure).
How about a
Jamie Lokier wrote:
Manfred Spraul wrote:
The other cpu writes the dirty bit - we just overwrite it ;-)
After the ptep_get_and_clear(), before the set_pte().
Ah, I see. The other CPU does an atomic *pte |= _PAGE_DIRTY, without
checking the present bit. ('scuse me for temporary brain
Manfred Spraul wrote:
Ok, Is there one case were your pragmatic solutions is vastly faster?
* mprotect: No. The difference is at most one additional locked
instruction for each pte.
Oh, what instruction is that?
* munmap(anon): No. We must handle delayed accessed anyway (don't call
On Fri, 16 Feb 2001, Jamie Lokier wrote:
It should be fast on known CPUs, correct on unknown ones, and much
simpler than "gather" code which may be completely unnecessary and
rather difficult to test.
If anyone reports the message, _then_ we think about the problem some more.
Ben, fancy
On Fri, 16 Feb 2001, Jamie Lokier wrote:
Manfred Spraul wrote:
Ok, Is there one case were your pragmatic solutions is vastly faster?
* mprotect: No. The difference is at most one additional locked
instruction for each pte.
Oh, what instruction is that?
The "set_pte()" thing could
Ben, fancy writing a boot-time test?
I'd never rely on such a test - what if the cpu checks in 99% of the
cases, but doesn't handle some cases ('rep movd, everything unaligned,
...'.
A good point. The test results are inconclusive.
And check the Pentium III erratas. There is one with
On Fri, 16 Feb 2001, Ben LaHaise wrote:
On Fri, 16 Feb 2001, Jamie Lokier wrote:
It should be fast on known CPUs, correct on unknown ones, and much
simpler than "gather" code which may be completely unnecessary and
rather difficult to test.
If anyone reports the message, _then_ we
On Fri, 16 Feb 2001, Linus Torvalds wrote:
How do you expect to ever see this in practice? Sounds basically
impossible to test for this hardware race. The obvious "try to dirty as
fast as possible on one CPU while doing an atomic get-and-clear on the
other" thing is not valid - it's in fact
On Fri, 16 Feb 2001, Manfred Spraul wrote:
Jamie Lokier wrote:
Linus Torvalds wrote:
So the only case that ends up being fairly heavy may be a case that is
very uncommon in practice (only for unmapping shared mappings in
threaded programs or the lazy TLB case).
The lazy tlb
Jamie Lokier wrote:
Ben, fancy writing a boot-time test?
I'd never rely on such a test - what if the cpu checks in 99% of the
cases, but doesn't handle some cases ('rep movd, everything unaligned,
...'.
A good point. The test results are inconclusive.
And check the Pentium
Manfred Spraul wrote:
A very simple test might be
cpu 1:
cpu 2:
Ben's test uses only one CPU.
Now start with variants:
change to read only instead of not present
a and b in the same way of the tlb, in a different way.
change pte with write, change with lock;
.
.
.
But you'll
On Fri, 16 Feb 2001, Jamie Lokier wrote:
And check the Pentium III erratas. There is one with the tlb
that's only triggered if 4 instruction lie in a certain window and all
access memory in the same way of the tlb (EFLAGS incorrect if 'andl
mask,memory_addr' causes page fault)).
Linus wrote:
That second pass is what I had in mind.
* munmap(file): No. Second pass required for correct msync behaviour.
It is?
Not now it isn't. We just do a msync() + fsync() for msync(MS_SYNC). Which
is admittedly not optimal, but it works.
Ok, munmap() will be fixed by
On Fri, 16 Feb 2001, Manfred Spraul wrote:
That leaves msync() - it currently does a flush_tlb_page() for every
single dirty page.
Is it possible to integrate that into the mmu gather code?
tlb_transfer_dirty() in addition to tlb_clear_page()?
Actually, in the filemap_sync case, the
On Fri, 16 Feb 2001, Manfred Spraul wrote:
That leaves msync() - it currently does a flush_tlb_page() for every
single dirty page.
Is it possible to integrate that into the mmu gather code?
Not even necessary.
The D bit does not have to be coherent. We need to make sure that we flush
the
On Fri, 16 Feb 2001, Ben LaHaise wrote:
Actually, in the filemap_sync case, the flush_tlb_page is redundant --
there's already a call to flush_tlb_range in filemap_sync after the dirty
bits are cleared.
This is not enough.
If another CPU has started write-out of one of the dirty pages
On Fri, 16 Feb 2001, Linus Torvalds wrote:
This is, actually, a problem that I suspect ends up being _very_ similar
to the zap_page_range() case. zap_page_range() needs to make sure that
everything has been updated by the time the page is actually free'd. While
filemap_sync() needs to make
On Thu, 15 Feb 2001, Manfred Spraul wrote:
>
> > Now, I will agree that I suspect most x86 _implementations_ will not do
> > this. TLB's are too timing-critical, and nobody tends to want to make
> > them bigger than necessary - so saving off the source address is
> > unlikely. Also, setting
On Fri, 16 Feb 2001, Jamie Lokier wrote:
>
> If you want to take it really far, it _could_ be that the TLB data
> contains both the pointer and the original pte contents. Then "mark
> dirty" becomes
>
>val |= D
>write *ptr
No. This is forbidden by the intel documentation.
Linus Torvalds wrote:
> It _could_ be that the TLB data actually also contains the pointer to
> the place where it was fetched, and a "mark dirty" becomes
>
> read *ptr locked
> val |= D
> write *ptr unlock
If you want to take it really far, it _could_ be that the TLB data
Manfred Spraul wrote:
>
> I just benchmarked a single flush_tlb_page().
>
> Pentium II 350: ~ 2000 cpu ticks.
> Pentium III 850: ~ 3000 cpu ticks.
>
I forgot the important part:
SMP, including a smp_call_function() IPI.
IIRC Ingo wrote that a local 'invplg' is around 100 ticks.
--
Linus Torvalds wrote:
>
> In article <[EMAIL PROTECTED]>,
> Jamie Lokier <[EMAIL PROTECTED]> wrote:
> >> > << lock;
> >> > read pte
> >> > if (!present(pte))
> >> >do_page_fault();
> >> > pte |= dirty
> >> > write pte.
> >> > >> end lock;
> >>
> >> No, it is a little more complicated. You
In article <[EMAIL PROTECTED]>,
Jamie Lokier <[EMAIL PROTECTED]> wrote:
>> > << lock;
>> > read pte
>> > if (!present(pte))
>> >do_page_fault();
>> > pte |= dirty
>> > write pte.
>> > >> end lock;
>>
>> No, it is a little more complicated. You also have to include in the
>> tlb state into
In article <[EMAIL PROTECTED]>,
Kanoj Sarcar <[EMAIL PROTECTED]> wrote:
>>
>> Will you please go off and prove that this "problem" exists on some x86
>> processor before continuing this rant? None of the PII, PIII, Athlon,
>
>And will you please stop behaving like this is not an issue?
This
>
> On Thu, 15 Feb 2001, Kanoj Sarcar wrote:
>
> > No. All architectures do not have this problem. For example, if the
> > Linux "dirty" (not the pte dirty) bit is managed by software, a fault
> > will actually be taken when processor 2 tries to do the write. The fault
> > is solely to make
Kanoj Sarcar wrote:
> > Is the sequence
> > << lock;
> > read pte
> > pte |= dirty
> > write pte
> > >> end lock;
> > or
> > << lock;
> > read pte
> > if (!present(pte))
> > do_page_fault();
> > pte |= dirty
> > write pte.
> > >> end lock;
>
> No, it is a little more complicated. You also
>
> Kanoj Sarcar wrote:
> >
> > Okay, I will quote from Intel Architecture Software Developer's Manual
> > Volume 3: System Programming Guide (1997 print), section 3.7, page 3-27:
> >
> > "Bus cycles to the page directory and page tables in memory are performed
> > only when the TLBs do not
Manfred Spraul wrote:
> Is the sequence
> << lock;
> read pte
> pte |= dirty
> write pte
> >> end lock;
> or
> << lock;
> read pte
> if (!present(pte))
> do_page_fault();
> pte |= dirty
> write pte.
> >> end lock;
or more generally
<< lock;
read pte
if (!present(pte) || !writable(pte))
On Thu, 15 Feb 2001, Kanoj Sarcar wrote:
> No. All architectures do not have this problem. For example, if the
> Linux "dirty" (not the pte dirty) bit is managed by software, a fault
> will actually be taken when processor 2 tries to do the write. The fault
> is solely to make sure that the
>
> Kanoj Sarcar wrote:
> > > Here's the important part: when processor 2 wants to set the pte's dirty
> > > bit, it *rereads* the pte and *rechecks* the permission bits again.
> > > Even though it has a non-dirty TLB entry for that pte.
> > >
> > > That is how I read Ben LaHaise's description,
Kanoj Sarcar wrote:
>
> Okay, I will quote from Intel Architecture Software Developer's Manual
> Volume 3: System Programming Guide (1997 print), section 3.7, page 3-27:
>
> "Bus cycles to the page directory and page tables in memory are performed
> only when the TLBs do not contain the
Kanoj Sarcar wrote:
> > Here's the important part: when processor 2 wants to set the pte's dirty
> > bit, it *rereads* the pte and *rechecks* the permission bits again.
> > Even though it has a non-dirty TLB entry for that pte.
> >
> > That is how I read Ben LaHaise's description, and his test
>
> [Added Linus and linux-kernel as I think it's of general interest]
>
> Kanoj Sarcar wrote:
> > Whether Jamie was trying to illustrate a different problem, I am not
> > sure.
>
> Yes, I was talking about pte_test_and_clear_dirty in the earlier post.
>
> > Look in mm/mprotect.c. Look at the
>
> [Added Linus and linux-kernel as I think it's of general interest]
>
> Kanoj Sarcar wrote:
> > Whether Jamie was trying to illustrate a different problem, I am not
> > sure.
>
> Yes, I was talking about pte_test_and_clear_dirty in the earlier post.
>
> > Look in mm/mprotect.c. Look at the
[Added Linus and linux-kernel as I think it's of general interest]
Kanoj Sarcar wrote:
> Whether Jamie was trying to illustrate a different problem, I am not
> sure.
Yes, I was talking about pte_test_and_clear_dirty in the earlier post.
> Look in mm/mprotect.c. Look at the call sequence
[Added Linus and linux-kernel as I think it's of general interest]
Kanoj Sarcar wrote:
Whether Jamie was trying to illustrate a different problem, I am not
sure.
Yes, I was talking about pte_test_and_clear_dirty in the earlier post.
Look in mm/mprotect.c. Look at the call sequence
[Added Linus and linux-kernel as I think it's of general interest]
Kanoj Sarcar wrote:
Whether Jamie was trying to illustrate a different problem, I am not
sure.
Yes, I was talking about pte_test_and_clear_dirty in the earlier post.
Look in mm/mprotect.c. Look at the call
[Added Linus and linux-kernel as I think it's of general interest]
Kanoj Sarcar wrote:
Whether Jamie was trying to illustrate a different problem, I am not
sure.
Yes, I was talking about pte_test_and_clear_dirty in the earlier post.
Look in mm/mprotect.c. Look at the call
Kanoj Sarcar wrote:
Here's the important part: when processor 2 wants to set the pte's dirty
bit, it *rereads* the pte and *rechecks* the permission bits again.
Even though it has a non-dirty TLB entry for that pte.
That is how I read Ben LaHaise's description, and his test program
Kanoj Sarcar wrote:
Okay, I will quote from Intel Architecture Software Developer's Manual
Volume 3: System Programming Guide (1997 print), section 3.7, page 3-27:
"Bus cycles to the page directory and page tables in memory are performed
only when the TLBs do not contain the translation
Kanoj Sarcar wrote:
Here's the important part: when processor 2 wants to set the pte's dirty
bit, it *rereads* the pte and *rechecks* the permission bits again.
Even though it has a non-dirty TLB entry for that pte.
That is how I read Ben LaHaise's description, and his test
Manfred Spraul wrote:
Is the sequence
lock;
read pte
pte |= dirty
write pte
end lock;
or
lock;
read pte
if (!present(pte))
do_page_fault();
pte |= dirty
write pte.
end lock;
or more generally
lock;
read pte
if (!present(pte) || !writable(pte))
do_page_fault();
On Thu, 15 Feb 2001, Kanoj Sarcar wrote:
No. All architectures do not have this problem. For example, if the
Linux "dirty" (not the pte dirty) bit is managed by software, a fault
will actually be taken when processor 2 tries to do the write. The fault
is solely to make sure that the Linux
Kanoj Sarcar wrote:
Okay, I will quote from Intel Architecture Software Developer's Manual
Volume 3: System Programming Guide (1997 print), section 3.7, page 3-27:
"Bus cycles to the page directory and page tables in memory are performed
only when the TLBs do not contain the
On Thu, 15 Feb 2001, Kanoj Sarcar wrote:
No. All architectures do not have this problem. For example, if the
Linux "dirty" (not the pte dirty) bit is managed by software, a fault
will actually be taken when processor 2 tries to do the write. The fault
is solely to make sure that the
Kanoj Sarcar wrote:
Is the sequence
lock;
read pte
pte |= dirty
write pte
end lock;
or
lock;
read pte
if (!present(pte))
do_page_fault();
pte |= dirty
write pte.
end lock;
No, it is a little more complicated. You also have to include in the
tlb state into
In article [EMAIL PROTECTED],
Kanoj Sarcar [EMAIL PROTECTED] wrote:
Will you please go off and prove that this "problem" exists on some x86
processor before continuing this rant? None of the PII, PIII, Athlon,
And will you please stop behaving like this is not an issue?
This is
In article [EMAIL PROTECTED],
Jamie Lokier [EMAIL PROTECTED] wrote:
lock;
read pte
if (!present(pte))
do_page_fault();
pte |= dirty
write pte.
end lock;
No, it is a little more complicated. You also have to include in the
tlb state into this algorithm. Since that is what
Linus Torvalds wrote:
In article [EMAIL PROTECTED],
Jamie Lokier [EMAIL PROTECTED] wrote:
lock;
read pte
if (!present(pte))
do_page_fault();
pte |= dirty
write pte.
end lock;
No, it is a little more complicated. You also have to include in the
tlb state into
Manfred Spraul wrote:
I just benchmarked a single flush_tlb_page().
Pentium II 350: ~ 2000 cpu ticks.
Pentium III 850: ~ 3000 cpu ticks.
I forgot the important part:
SMP, including a smp_call_function() IPI.
IIRC Ingo wrote that a local 'invplg' is around 100 ticks.
--
Manfred
-
Linus Torvalds wrote:
It _could_ be that the TLB data actually also contains the pointer to
the place where it was fetched, and a "mark dirty" becomes
read *ptr locked
val |= D
write *ptr unlock
If you want to take it really far, it _could_ be that the TLB data
contains
On Fri, 16 Feb 2001, Jamie Lokier wrote:
If you want to take it really far, it _could_ be that the TLB data
contains both the pointer and the original pte contents. Then "mark
dirty" becomes
val |= D
write *ptr
No. This is forbidden by the intel documentation. First
On Thu, 15 Feb 2001, Manfred Spraul wrote:
Now, I will agree that I suspect most x86 _implementations_ will not do
this. TLB's are too timing-critical, and nobody tends to want to make
them bigger than necessary - so saving off the source address is
unlikely. Also, setting the D bit
81 matches
Mail list logo