Re: CVS commit: src/sys/arch/xen

Cherry G . Mathew Mon, 29 Aug 2011 03:09:33 -0700

Cc:ing tech-kern, to get wider feedback.
Thread started here: 
http://mail-index.netbsd.org/source-changes-d/2011/08/21/msg003897.html

>>>>> "JM" == Jean-Yves Migeon <jeanyves.mig...@free.fr> writes:

    JM> On Mon, 22 Aug 2011 12:47:40 +0200, Manuel Bouyer wrote:
    >>> This is slightly more complicated than it appears. Some of the
    >>> "ops" in a per-cpu queue may have ordering dependencies with
    >>> other cpu queues, and I think this would be hard to express
    >>> trivially. (an example would be a pte update on one queue, and
    >>> reading the same pte read on another queue - these cases are
    >>> quite analogous (although completely unrelated)
    >> 

Hi,

So I had a better look at this - implemented per-cpu queues and messed
with locking a bit:

    >> read don't go through the xpq queue, don't they ?

    JM> Nope, PTE are directly obtained from the recursive mappings
    JM> (vtopte/kvtopte).

Let's call this "out of band" reads. But see below for "in-band" reads.

    JM> Content is "obviously" only writable by hypervisor (so it can
    JM> keep control of his mapping alone).
    >> I think this is similar to a tlb flush but the other way round, I
    >> guess we could use a IPI for this too.

    JM> IIRC that's what the current native x86 code does: it uses an
    JM> IPI to signal other processors that a shootdown is necessary.

Xen's TLB_FLUSH operation is synchronous, and doesn't require an IPI
(within the domain), which makes the queue ordering even more important
(to make sure that stale ptes are not reloaded before the per-cpu queue
has made progress). Yes, we can implement a roundabout ipi driven
queueflush + tlbflush scheme(described below), but that would be
performance sensitive, and the basic issue won't go away, imho.

Let's stick to the xpq ops for a second, ignoring "out-of-band" reads
(for which I agree that your assertion, that locking needs to be done at
a higher level, holds true).

The question here, really is, what are the global ordering requirements
of per-cpu memory op queues, given the following basic "ops":

i) write memory (via MMU_NORMAL_PT_UPDATE, MMU_MACHPHYS_UPDATE)
ii) read memory
    via:
    MMUEXT_PIN_L1_TABLE
    MMUEXT_PIN_L2_TABLE
    MMUEXT_PIN_L3_TABLE
    MMUEXT_PIN_L4_TABLE
    MMUEXT_UNPIN_TABLE
    MMUEXT_NEW_BASEPTR
    MMUEXT_TLB_FLUSH_LOCAL
    MMUEXT_INVLPG_LOCAL
    MMUEXT_TLB_FLUSH_MULTI
    MMUEXT_INVLPG_MULTI
    MMUEXT_TLB_FLUSH_ALL
    MMUEXT_INVLPG_ALL
    MMUEXT_FLUSH_CACHE
    MMUEXT_NEW_USER_BASEPTR

    (ie; anything that will cause the processor to re-read data updated
    via another cpu (via, for eg: pte update with i), above)

There's two ways I can think of fixing this:

a) *before* queueing a local read-op, synchronously flush queues on all
   other CPUs via ipis. This is slightly racy, but can be done, I think.

   An optimisation for invlpg could be to implement a scoreboard that
   watches mem. locations that have been queued for update on any
   cpu. Scan through the scoreboard for the memory range we're
   invlpg-ing. If it's not there, there's no need to flush any queues on
   other cpus.

b) read-ops wait on a global condvar. If it's set, a write-op that needs
   flushing is pending. Wait (with optional timeout and ipi-"nudge")
   until the remote queue is flushed. When flushing a queue, send a
   cv_broadcast to any waiters.

Option "b)" is slightly better than my current scheme which is to lock
any and all mmu-ops and operate the queues synchronously (via
XENDEBUG_SYNC).

I cannot think of anything else, other than ad-hoc locking + queue
flushing, which could be hard to maintain and debug in the long run.

    >>> I'm thinking that it might be easier and more justifiable to
    >>> nuke the current queue scheme and implement shadow page tables,
    >>> which would fit more naturally and efficiently with CAS pte
    >>> updates, etc.
    >> 
    >> I'm not sure this would completely fis the issue: with shadow
    >> page tables you can't use a CAS to assure atomic operation with
    >> the hardware TLB, as this is, precisely, a shadow PT and not the
    >> one used by hardware.

Definitely worth looking into, I imho. I'm not very comfortable with
the queue based scheme for MP.

the CAS doesn't provide any guarantees with the TLB on native h/w,
afaict. If you do a CAS pte update, and the update succeeded, it's a
good idea to invalidate + shootdown anyway (even on baremetal).

Do let me know your thoughts,

Cheers,
-- 
Cherry

Re: CVS commit: src/sys/arch/xen

Reply via email to