Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation

2016-01-07 Thread alvise rigo
On Thu, Jan 7, 2016 at 11:22 AM, Peter Maydell  wrote:
> On 7 January 2016 at 10:21, alvise rigo  wrote:
>> Hi,
>>
>> On Wed, Jan 6, 2016 at 7:00 PM, Andrew Baumann
>>  wrote:
>>> As a heads up, we just added support for alignment checks in LDREX:
>>> https://github.com/qemu/qemu/commit/30901475b91ef1f46304404ab4bfe89097f61b96
>
>> It should be if we add an aligned variant for each of the exclusive helper.
>> BTW, why don't we make the check also for the STREX instruction?
>
> Andrew's patch only changed the bits Windows cares about, I think.
> We should indeed extend this to cover also STREX and the A64 instructions
> as well, I think.

The alignment check is easily doable in general. The only tricky part
I found is the A64's STXP instruction that requires quadword alignment
for the 64bit paired access.
In that case, the translation of the instruction will rely on a
aarch64-only helper. The alternative solution would be to extend
softmmu_template.h to generate 128bit accesses, but I don't believe
this is the right way to go.

Regards,
alvise

>
> thanks
> -- PMM



Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation

2016-01-07 Thread alvise rigo
Hi,

On Wed, Jan 6, 2016 at 7:00 PM, Andrew Baumann
 wrote:
>
> Hi,
>
> > From: qemu-devel-bounces+andrew.baumann=microsoft@nongnu.org
> > [mailto:qemu-devel-
> > bounces+andrew.baumann=microsoft@nongnu.org] On Behalf Of
> > Alvise Rigo
> > Sent: Monday, 14 December 2015 00:41
> >
> > This is the sixth iteration of the patch series which applies to the
> > upstream branch of QEMU (v2.5.0-rc3).
> >
> > Changes versus previous versions are at the bottom of this cover letter.
> >
> > The code is also available at following repository:
> > https://git.virtualopensystems.com/dev/qemu-mt.git
> > branch:
> > slowpath-for-atomic-v6-no-mttcg
> >
> > This patch series provides an infrastructure for atomic instruction
> > implementation in QEMU, thus offering a 'legacy' solution for
> > translating guest atomic instructions. Moreover, it can be considered as
> > a first step toward a multi-thread TCG.
> >
> > The underlying idea is to provide new TCG helpers (sort of softmmu
> > helpers) that guarantee atomicity to some memory accesses or in general
> > a way to define memory transactions.
> >
> > More specifically, the new softmmu helpers behave as LoadLink and
> > StoreConditional instructions, and are called from TCG code by means of
> > target specific helpers. This work includes the implementation for all
> > the ARM atomic instructions, see target-arm/op_helper.c.
>
> As a heads up, we just added support for alignment checks in LDREX:
> https://github.com/qemu/qemu/commit/30901475b91ef1f46304404ab4bfe89097f61b96

Thank you for the update.

>
> Hopefully it is an easy change to ensure that the same check happens for the 
> relevant loads when CONFIG_TCG_USE_LDST_EXCL is enabled?

It should be if we add an aligned variant for each of the exclusive helper.
BTW, why don't we make the check also for the STREX instruction?

Regards,
alvise

>
> Thanks,
> Andrew



Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation

2016-01-06 Thread Andrew Baumann
Hi,

> From: qemu-devel-bounces+andrew.baumann=microsoft@nongnu.org
> [mailto:qemu-devel-
> bounces+andrew.baumann=microsoft@nongnu.org] On Behalf Of
> Alvise Rigo
> Sent: Monday, 14 December 2015 00:41
> 
> This is the sixth iteration of the patch series which applies to the
> upstream branch of QEMU (v2.5.0-rc3).
> 
> Changes versus previous versions are at the bottom of this cover letter.
> 
> The code is also available at following repository:
> https://git.virtualopensystems.com/dev/qemu-mt.git
> branch:
> slowpath-for-atomic-v6-no-mttcg
> 
> This patch series provides an infrastructure for atomic instruction
> implementation in QEMU, thus offering a 'legacy' solution for
> translating guest atomic instructions. Moreover, it can be considered as
> a first step toward a multi-thread TCG.
> 
> The underlying idea is to provide new TCG helpers (sort of softmmu
> helpers) that guarantee atomicity to some memory accesses or in general
> a way to define memory transactions.
> 
> More specifically, the new softmmu helpers behave as LoadLink and
> StoreConditional instructions, and are called from TCG code by means of
> target specific helpers. This work includes the implementation for all
> the ARM atomic instructions, see target-arm/op_helper.c.

As a heads up, we just added support for alignment checks in LDREX:
https://github.com/qemu/qemu/commit/30901475b91ef1f46304404ab4bfe89097f61b96

Hopefully it is an easy change to ensure that the same check happens for the 
relevant loads when CONFIG_TCG_USE_LDST_EXCL is enabled?

Thanks,
Andrew


Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation

2015-12-17 Thread alvise rigo
Hi Alex,

On Thu, Dec 17, 2015 at 5:06 PM, Alex Bennée  wrote:

>
> Alvise Rigo  writes:
>
> > This is the sixth iteration of the patch series which applies to the
> > upstream branch of QEMU (v2.5.0-rc3).
> >
> > Changes versus previous versions are at the bottom of this cover letter.
> >
> > The code is also available at following repository:
> > https://git.virtualopensystems.com/dev/qemu-mt.git
> > branch:
> > slowpath-for-atomic-v6-no-mttcg
>
> I'm starting to look through this now. However one problem that
>

Thank you for this.


> immediately comes up is the aarch64 breakage. Because there is an
> intrinsic link between a lot of the arm and aarch64 code it breaks the
> other targets.
>
> You could fix this by ensuring that CONFIG_TCG_USE_LDST_EXCL doesn't get
> passed to the aarch64 build (tricky as aarch64-softmmu.mak includes
> arm-softmmu.mak) or bite the bullet now and add the 64 bit helpers that
> will be needed to convert the aarch64 exclusive equivalents.
>

This is what I'm doing right now :)

Best regards,
alvise


>
> >
> > This patch series provides an infrastructure for atomic instruction
> > implementation in QEMU, thus offering a 'legacy' solution for
> > translating guest atomic instructions. Moreover, it can be considered as
> > a first step toward a multi-thread TCG.
> >
> > The underlying idea is to provide new TCG helpers (sort of softmmu
> > helpers) that guarantee atomicity to some memory accesses or in general
> > a way to define memory transactions.
> >
> > More specifically, the new softmmu helpers behave as LoadLink and
> > StoreConditional instructions, and are called from TCG code by means of
> > target specific helpers. This work includes the implementation for all
> > the ARM atomic instructions, see target-arm/op_helper.c.
> >
> > The implementation heavily uses the software TLB together with a new
> > bitmap that has been added to the ram_list structure which flags, on a
> > per-CPU basis, all the memory pages that are in the middle of a LoadLink
> > (LL), StoreConditional (SC) operation.  Since all these pages can be
> > accessed directly through the fast-path and alter a vCPU's linked value,
> > the new bitmap has been coupled with a new TLB flag for the TLB virtual
> > address which forces the slow-path execution for all the accesses to a
> > page containing a linked address.
> >
> > The new slow-path is implemented such that:
> > - the LL behaves as a normal load slow-path, except for clearing the
> >   dirty flag in the bitmap.  The cputlb.c code while generating a TLB
> >   entry, checks if there is at least one vCPU that has the bit cleared
> >   in the exclusive bitmap, it that case the TLB entry will have the EXCL
> >   flag set, thus forcing the slow-path.  In order to ensure that all the
> >   vCPUs will follow the slow-path for that page, we flush the TLB cache
> >   of all the other vCPUs.
> >
> >   The LL will also set the linked address and size of the access in a
> >   vCPU's private variable. After the corresponding SC, this address will
> >   be set to a reset value.
> >
> > - the SC can fail returning 1, or succeed, returning 0.  It has to come
> >   always after a LL and has to access the same address 'linked' by the
> >   previous LL, otherwise it will fail. If in the time window delimited
> >   by a legit pair of LL/SC operations another write access happens to
> >   the linked address, the SC will fail.
> >
> > In theory, the provided implementation of TCG LoadLink/StoreConditional
> > can be used to properly handle atomic instructions on any architecture.
> >
> > The code has been tested with bare-metal test cases and by booting Linux.
> >
> > * Performance considerations
> > The new slow-path adds some overhead to the translation of the ARM
> > atomic instructions, since their emulation doesn't happen anymore only
> > in the guest (by mean of pure TCG generated code), but requires the
> > execution of two helpers functions. Despite this, the additional time
> > required to boot an ARM Linux kernel on an i7 clocked at 2.5GHz is
> > negligible.
> > Instead, on a LL/SC bound test scenario - like:
> > https://git.virtualopensystems.com/dev/tcg_baremetal_tests.git - this
> > solution requires 30% (1 million iterations) and 70% (10 millions
> > iterations) of additional time for the test to complete.
> >
> > Changes from v5:
> > - The exclusive memory region is now set through a CPUClass hook,
> >   allowing any architecture to decide the memory area that will be
> >   protected during a LL/SC operation [PATCH 3]
> > - The runtime helpers dropped any target dependency and are now in a
> >   common file [PATCH 5]
> > - Improved the way we restore a guest page as non-exclusive [PATCH 9]
> > - Included MMIO memory as possible target of LL/SC
> >   instructions. This also required to somehow simplify the
> >   helper_*_st_name helpers in softmmu_template.h [PATCH 8-14]
> >
> > Changes from v4:
> > - Reworked the exclusive bitmap to be of fixed size (8 b

Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation

2015-12-17 Thread Alex Bennée

Alvise Rigo  writes:

> This is the sixth iteration of the patch series which applies to the
> upstream branch of QEMU (v2.5.0-rc3).
>
> Changes versus previous versions are at the bottom of this cover letter.
>
> The code is also available at following repository:
> https://git.virtualopensystems.com/dev/qemu-mt.git
> branch:
> slowpath-for-atomic-v6-no-mttcg

I'm starting to look through this now. However one problem that
immediately comes up is the aarch64 breakage. Because there is an
intrinsic link between a lot of the arm and aarch64 code it breaks the
other targets.

You could fix this by ensuring that CONFIG_TCG_USE_LDST_EXCL doesn't get
passed to the aarch64 build (tricky as aarch64-softmmu.mak includes
arm-softmmu.mak) or bite the bullet now and add the 64 bit helpers that
will be needed to convert the aarch64 exclusive equivalents.

>
> This patch series provides an infrastructure for atomic instruction
> implementation in QEMU, thus offering a 'legacy' solution for
> translating guest atomic instructions. Moreover, it can be considered as
> a first step toward a multi-thread TCG.
>
> The underlying idea is to provide new TCG helpers (sort of softmmu
> helpers) that guarantee atomicity to some memory accesses or in general
> a way to define memory transactions.
>
> More specifically, the new softmmu helpers behave as LoadLink and
> StoreConditional instructions, and are called from TCG code by means of
> target specific helpers. This work includes the implementation for all
> the ARM atomic instructions, see target-arm/op_helper.c.
>
> The implementation heavily uses the software TLB together with a new
> bitmap that has been added to the ram_list structure which flags, on a
> per-CPU basis, all the memory pages that are in the middle of a LoadLink
> (LL), StoreConditional (SC) operation.  Since all these pages can be
> accessed directly through the fast-path and alter a vCPU's linked value,
> the new bitmap has been coupled with a new TLB flag for the TLB virtual
> address which forces the slow-path execution for all the accesses to a
> page containing a linked address.
>
> The new slow-path is implemented such that:
> - the LL behaves as a normal load slow-path, except for clearing the
>   dirty flag in the bitmap.  The cputlb.c code while generating a TLB
>   entry, checks if there is at least one vCPU that has the bit cleared
>   in the exclusive bitmap, it that case the TLB entry will have the EXCL
>   flag set, thus forcing the slow-path.  In order to ensure that all the
>   vCPUs will follow the slow-path for that page, we flush the TLB cache
>   of all the other vCPUs.
>
>   The LL will also set the linked address and size of the access in a
>   vCPU's private variable. After the corresponding SC, this address will
>   be set to a reset value.
>
> - the SC can fail returning 1, or succeed, returning 0.  It has to come
>   always after a LL and has to access the same address 'linked' by the
>   previous LL, otherwise it will fail. If in the time window delimited
>   by a legit pair of LL/SC operations another write access happens to
>   the linked address, the SC will fail.
>
> In theory, the provided implementation of TCG LoadLink/StoreConditional
> can be used to properly handle atomic instructions on any architecture.
>
> The code has been tested with bare-metal test cases and by booting Linux.
>
> * Performance considerations
> The new slow-path adds some overhead to the translation of the ARM
> atomic instructions, since their emulation doesn't happen anymore only
> in the guest (by mean of pure TCG generated code), but requires the
> execution of two helpers functions. Despite this, the additional time
> required to boot an ARM Linux kernel on an i7 clocked at 2.5GHz is
> negligible.
> Instead, on a LL/SC bound test scenario - like:
> https://git.virtualopensystems.com/dev/tcg_baremetal_tests.git - this
> solution requires 30% (1 million iterations) and 70% (10 millions
> iterations) of additional time for the test to complete.
>
> Changes from v5:
> - The exclusive memory region is now set through a CPUClass hook,
>   allowing any architecture to decide the memory area that will be
>   protected during a LL/SC operation [PATCH 3]
> - The runtime helpers dropped any target dependency and are now in a
>   common file [PATCH 5]
> - Improved the way we restore a guest page as non-exclusive [PATCH 9]
> - Included MMIO memory as possible target of LL/SC
>   instructions. This also required to somehow simplify the
>   helper_*_st_name helpers in softmmu_template.h [PATCH 8-14]
>
> Changes from v4:
> - Reworked the exclusive bitmap to be of fixed size (8 bits per address)
> - The slow-path is now TCG backend independent, no need to touch
>   tcg/* anymore as suggested by Aurelien Jarno.
>
> Changes from v3:
> - based on upstream QEMU
> - addressed comments from Alex Bennée
> - the slow path can be enabled by the user with:
>   ./configure --enable-tcg-ldst-excl only if the backend support

Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation

2015-12-15 Thread alvise rigo
On Tue, Dec 15, 2015 at 3:18 PM, Paolo Bonzini  wrote:
>
>
> On 15/12/2015 14:59, alvise rigo wrote:
>>> > If we have two CPUs, with CPU 0 executing LL and the CPU 1 executing a
>>> > store, you can model this as a consensus problem.  For example, CPU 0
>>> > could propose that the subsequent SC succeeds, while CPU 1 proposes that
>>> > it fails.  The outcome of the SC instruction depends on who wins.
>> I see your point. This, as you wrote, holds only when we attempt to
>> make the fast path wait-free.
>> However, the implementation I proposed is not wait-free and somehow
>> serializes the accesses made to the shared resources (that will
>> determine if the access was successful or not) by means of a mutex.
>> The assumption I made - and somehow verified - is that the "colliding
>> fast accesses" are rare.
>
> Isn't the fast path (where TLB_EXCL is not set) wait-free?

There is no such a fast path if we force every CPU to exit the TB and
flush the TLB.
I though that with "fast path" you were referring to a slow path
forced through TLB_EXCL, sorry.

alvise

>
> This is enough to mess up the theory, though in practice it works.
>
>> I guess you also agree on this, otherwise how
>> could a wait-free implementation possibly work without being coupled
>> with primitives with appropriate consensus number?
>
> It couldn't. :)
>
> Paolo



Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation

2015-12-15 Thread Paolo Bonzini


On 15/12/2015 14:59, alvise rigo wrote:
>> > If we have two CPUs, with CPU 0 executing LL and the CPU 1 executing a
>> > store, you can model this as a consensus problem.  For example, CPU 0
>> > could propose that the subsequent SC succeeds, while CPU 1 proposes that
>> > it fails.  The outcome of the SC instruction depends on who wins.
> I see your point. This, as you wrote, holds only when we attempt to
> make the fast path wait-free.
> However, the implementation I proposed is not wait-free and somehow
> serializes the accesses made to the shared resources (that will
> determine if the access was successful or not) by means of a mutex.
> The assumption I made - and somehow verified - is that the "colliding
> fast accesses" are rare.

Isn't the fast path (where TLB_EXCL is not set) wait-free?

This is enough to mess up the theory, though in practice it works.

> I guess you also agree on this, otherwise how
> could a wait-free implementation possibly work without being coupled
> with primitives with appropriate consensus number?

It couldn't. :)

Paolo



Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation

2015-12-15 Thread alvise rigo
Hi Paolo,

On Mon, Dec 14, 2015 at 11:17 AM, Paolo Bonzini  wrote:
>
>
> On 14/12/2015 11:04, alvise rigo wrote:
>> In any case, what I proposed in the mttcg based v5 was:
>> - A LL ensures that the TLB_EXCL flag is set on all the CPU's TLB.
>> This is done by querying a TLB flush to all (not exactly all...) the
>> CPUs. To be 100% safe, probably we should also wait that the flush is
>> actually performed
>> - A TLB_EXCL flag set always forces the slow-path, allowing the CPUs
>> to check for possible collision with a "exclusive memory region"
>>
>> Now, why the fact of querying the flush (and possibly ensuring that
>> the flush has been actually done) should not be enough?
>
> There will always be a race where the normal store fails.  While I
> haven't studied your code enough to do a constructive proof, it's enough
> to prove the impossibility of what you're trying to do.  Mind, I also
> believed for a long time that it was possible to do it!
>
> If we have two CPUs, with CPU 0 executing LL and the CPU 1 executing a
> store, you can model this as a consensus problem.  For example, CPU 0
> could propose that the subsequent SC succeeds, while CPU 1 proposes that
> it fails.  The outcome of the SC instruction depends on who wins.

I see your point. This, as you wrote, holds only when we attempt to
make the fast path wait-free.
However, the implementation I proposed is not wait-free and somehow
serializes the accesses made to the shared resources (that will
determine if the access was successful or not) by means of a mutex.
The assumption I made - and somehow verified - is that the "colliding
fast accesses" are rare. I guess you also agree on this, otherwise how
could a wait-free implementation possibly work without being coupled
with primitives with appropriate consensus number?

Thank you,
alvise

>
> Therefore, implementing LL/SC problem requires---on both CPU 0 and CPU
> 1, and hence for both LL/SC and normal store---an atomic primitive with
> consensus number >= 2.  Other than LL/SC itself, the commonly-available
> operations satisfying this requirement are test-and-set (consensus
> number 2) and compare-and-swap (infinite consensus number).  Normal
> memory reads and writes (called "atomic registers" in multi-processing
> research lingo) have consensus number 1; it's not enough.
>
> If the host had LL/SC, CPU 1 could in principle delegate its side of the
> consensus problem to the processor; but even that is not a solution
> because processors constrain the instructions that can appear between
> the load and the store, and this could cause an infinite sequence of
> spurious failed SCs.  Another option is transactional memory, but it's
> also too slow for normal stores.
>
> The simplest solution is not to implement full LL/SC semantics; instead,
> similar to linux-user, a SC operation can perform a cmpxchg from the
> value fetched by LL to the argument of SC.  This bypasses the issue
> because stores do not have to be instrumented at all, but it does mean
> that the emulation suffers from the ABA problem.
>
> TLB_EXCL is also a middle-ground, a little bit stronger than cmpxchg.
> It's more complex and more accurate, but also not perfect.  Which is
> okay, but has to be documented.
>
> Paolo



Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation

2015-12-15 Thread alvise rigo
Hi Andreas,

On Mon, Dec 14, 2015 at 11:09 PM, Andreas Tobler  wrote:
> Alvise,
>
> On 14.12.15 09:41, Alvise Rigo wrote:
>>
>> This is the sixth iteration of the patch series which applies to the
>> upstream branch of QEMU (v2.5.0-rc3).
>>
>> Changes versus previous versions are at the bottom of this cover letter.
>>
>> The code is also available at following repository:
>> https://git.virtualopensystems.com/dev/qemu-mt.git
>> branch:
>> slowpath-for-atomic-v6-no-mttcg
>
>
> Thank you very much for this work. I tried to rebase myself, but it was over
> my head.
>
> I'm looking for a qemu solution where I can use my cores.
>
> My use case is doing gcc porting for aarch64-*-freebsd*. I think it doesn't
> matter which OS. This arch has not enough real affordable HW solutions on
> the market yet. So I was looking for your solution. Claudio gave me a hint
> about it.
>
> Your recent merge/rebase only covers arm itself, not aarch64, right?

Indeed, only arm. Keep in mind that this patch series applies to the
upstream version of QEMU, not to the mttcg branch.
In other words, the repo includes a version of QEMU which is
single-threaded with some changes for the atomic instructions handling
in sight of a multi-threaded emulation.

>
> Linking fails with unreferenced cpu_exclusive_addr stuff in
> target-arm/translate-a64.c

Even if aarch64 is not supported, this error should not happen. My
fault, I will fix it in the coming version.

>
> Are you working on this already? Or Claudio?

As soon as the mttcg branch will be updated, I will rebase this patch
series on top of the new branch, and possibly I will also cover the
aarch64 architecture.

Thank you,
alvise

>
>> This work has been sponsored by Huawei Technologies Duesseldorf GmbH.
>
>
> ...
>
> Thank you!
> Andreas
>



Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation

2015-12-14 Thread Andreas Tobler

Alvise,

On 14.12.15 09:41, Alvise Rigo wrote:

This is the sixth iteration of the patch series which applies to the
upstream branch of QEMU (v2.5.0-rc3).

Changes versus previous versions are at the bottom of this cover letter.

The code is also available at following repository:
https://git.virtualopensystems.com/dev/qemu-mt.git
branch:
slowpath-for-atomic-v6-no-mttcg


Thank you very much for this work. I tried to rebase myself, but it was 
over my head.


I'm looking for a qemu solution where I can use my cores.

My use case is doing gcc porting for aarch64-*-freebsd*. I think it 
doesn't matter which OS. This arch has not enough real affordable HW 
solutions on the market yet. So I was looking for your solution. Claudio 
gave me a hint about it.


Your recent merge/rebase only covers arm itself, not aarch64, right?

Linking fails with unreferenced cpu_exclusive_addr stuff in 
target-arm/translate-a64.c


Are you working on this already? Or Claudio?


This work has been sponsored by Huawei Technologies Duesseldorf GmbH.


...

Thank you!
Andreas




Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation

2015-12-14 Thread Paolo Bonzini


On 14/12/2015 11:04, alvise rigo wrote:
> In any case, what I proposed in the mttcg based v5 was:
> - A LL ensures that the TLB_EXCL flag is set on all the CPU's TLB.
> This is done by querying a TLB flush to all (not exactly all...) the
> CPUs. To be 100% safe, probably we should also wait that the flush is
> actually performed
> - A TLB_EXCL flag set always forces the slow-path, allowing the CPUs
> to check for possible collision with a "exclusive memory region"
> 
> Now, why the fact of querying the flush (and possibly ensuring that
> the flush has been actually done) should not be enough?

There will always be a race where the normal store fails.  While I
haven't studied your code enough to do a constructive proof, it's enough
to prove the impossibility of what you're trying to do.  Mind, I also
believed for a long time that it was possible to do it!

If we have two CPUs, with CPU 0 executing LL and the CPU 1 executing a
store, you can model this as a consensus problem.  For example, CPU 0
could propose that the subsequent SC succeeds, while CPU 1 proposes that
it fails.  The outcome of the SC instruction depends on who wins.

Therefore, implementing LL/SC problem requires---on both CPU 0 and CPU
1, and hence for both LL/SC and normal store---an atomic primitive with
consensus number >= 2.  Other than LL/SC itself, the commonly-available
operations satisfying this requirement are test-and-set (consensus
number 2) and compare-and-swap (infinite consensus number).  Normal
memory reads and writes (called "atomic registers" in multi-processing
research lingo) have consensus number 1; it's not enough.

If the host had LL/SC, CPU 1 could in principle delegate its side of the
consensus problem to the processor; but even that is not a solution
because processors constrain the instructions that can appear between
the load and the store, and this could cause an infinite sequence of
spurious failed SCs.  Another option is transactional memory, but it's
also too slow for normal stores.

The simplest solution is not to implement full LL/SC semantics; instead,
similar to linux-user, a SC operation can perform a cmpxchg from the
value fetched by LL to the argument of SC.  This bypasses the issue
because stores do not have to be instrumented at all, but it does mean
that the emulation suffers from the ABA problem.

TLB_EXCL is also a middle-ground, a little bit stronger than cmpxchg.
It's more complex and more accurate, but also not perfect.  Which is
okay, but has to be documented.

Paolo



Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation

2015-12-14 Thread alvise rigo
Hi Paolo,


Thank you for your feedback.

On Mon, Dec 14, 2015 at 10:33 AM, Paolo Bonzini  wrote:
>
>
>
> On 14/12/2015 09:41, Alvise Rigo wrote:
> > In theory, the provided implementation of TCG LoadLink/StoreConditional
> > can be used to properly handle atomic instructions on any architecture.
>
> No, _in theory_ this implementation is wrong.  If a normal store can
> make a concurrent LL-SC pair fail, it's provably _impossible_ to handle
> LL/SC with a wait-free fast path for normal stores.
>
> If we decide that it's "good enough", because the race is incredibly
> rare and doesn't happen anyway for spinlocks, then fine.  But it should
> be represented correctly in the commit messages.


I did not yet commented extensively this issue since this is still the
"single-threaded" version of the patch series.
As soon as the next version of mttcg will be released, I will rebase
this series on top of the multi-threaded code.

In any case, what I proposed in the mttcg based v5 was:
- A LL ensures that the TLB_EXCL flag is set on all the CPU's TLB.
This is done by querying a TLB flush to all (not exactly all...) the
CPUs. To be 100% safe, probably we should also wait that the flush is
actually performed
- A TLB_EXCL flag set always forces the slow-path, allowing the CPUs
to check for possible collision with a "exclusive memory region"

Now, why the fact of querying the flush (and possibly ensuring that
the flush has been actually done) should not be enough?

Thank you,
alvise

>
>
> Paolo



Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation

2015-12-14 Thread Paolo Bonzini


On 14/12/2015 09:41, Alvise Rigo wrote:
> In theory, the provided implementation of TCG LoadLink/StoreConditional
> can be used to properly handle atomic instructions on any architecture.

No, _in theory_ this implementation is wrong.  If a normal store can
make a concurrent LL-SC pair fail, it's provably _impossible_ to handle
LL/SC with a wait-free fast path for normal stores.

If we decide that it's "good enough", because the race is incredibly
rare and doesn't happen anyway for spinlocks, then fine.  But it should
be represented correctly in the commit messages.

Paolo