On Mon, 2024-02-19 at 12:22 +0100, Jan Beulich wrote:
> On 15.02.2024 14:41, Oleksii wrote:
> > > > +        : "=r" (ret), "+A" (*ptr) \
> > > > +        : "r" (new) \
> > > > +        : "memory" ); \
> > > > +})
> > > > +
> > > > +#define emulate_xchg_1_2(ptr, new, ret, release_barrier,
> > > > acquire_barrier) \
> > > > +({ \
> > > > +    uint32_t *ptr_32b_aligned = (uint32_t
> > > > *)ALIGN_DOWN((unsigned
> > > > long)ptr, 4); \
> > > 
> > > You now appear to assume that this macro is only used with inputs
> > > not
> > > crossing word boundaries. That's okay as long as suitably
> > > guaranteed
> > > at the use sites, but imo wants saying in a comment.
> > > 
> > > > +    uint8_t mask_l = ((unsigned long)(ptr) & (0x8 -
> > > > sizeof(*ptr)))
> > > > * BITS_PER_BYTE; \
> > > 
> > > Why 0x8 (i.e. spanning 64 bits), not 4 (matching the uint32_t use
> > > above)?
> > The idea to read 8 bytes was to deal with crossing word boundary.
> > So if
> > our address is 0x3 and we have to xchg() 2 bytes, what will cross 4
> > byte boundary. Instead we align add 0x3, so it will become 0x0 and
> > then
> > just always work with 8 bytes.
> 
> Then what if my 2-byte access crosses a dword boundary? A cache line
> one? A page one?
Everything looks okay to me, except in the case of a page boundary.

In the scenario of a dword boundary:

0 1 2 3 4 5 6 7 8 9 ...
X X X X X X X 1 1 X

Assuming a variable starts at address 7, 4-byte alignment will be
enforced, and 8 bytes will be processed starting from address 4.

Concerning a cache line, it should still work, with potential
performance issues arising only if a part of the variable is cached
while another part is not.

Regarding page crossing, I acknowledge that it could be problematic if
the variable is entirely located at the end of a page, as there is no
guarantee that the next page exists. In this case, it would be
preferable to consistently read 4 bytes with 4-byte alignment:

X 4094 4095 4096?
X  1    1    ?

However, if the variable spans two pages, proper page mapping should be
ensured.

It appears sensible to reconsider the macros and implement 4-byte
alignment and 4-byte access, but then this is not clear how better to
deal with first case ( dword boundary ). Panic ? or use the macros
twice for address 4, and address 8?

> 
> > > > +    unsigned long new_ = (unsigned long)(new) << mask_l; \
> > > > +    unsigned long ret_; \
> > > > +    unsigned long rc; \
> > > 
> > > Similarly, why unsigned long here?
> > sizeof(unsigned long) is 8 bytes and it was chosen as we are
> > working
> > with lc/sc.d which are working with 8 bytes.
> > 
> > > 
> > > I also wonder about the mix of underscore suffixed (or not)
> > > variable
> > > names here.
> > If the question about ret_, then the same as before size of ret
> > argument of the macros will be 1 or 2, but {lc/sc}.d expected to
> > work
> > with 8 bytes.
> 
> Then what's the uint32_t * about?
Agree, then it should be also unsigned long.
> > 

> > > > +    __typeof__(*(ptr)) new__ = (new); \
> > > > +    __typeof__(*(ptr)) ret__; \
> > > > +    switch (size) \
> > > > +    { \
> > > > +    case 1: \
> > > > +    case 2: \
> > > > +        emulate_xchg_1_2(ptr__, new__, ret__, release_barrier,
> > > > acquire_barrier); \
> > > 
> > > ... this would become
> > > 
> > >         ret__ = emulate_xchg_1_2(ptr__, new__, release_barrier,
> > > acquire_barrier); \
> > > 
> > > But, unlike assumed above, there's no enforcement here that a 2-
> > > byte
> > > quantity won't cross a word, double-word, cache line, or even
> > > page
> > > boundary. That might be okay if then the code would simply crash
> > > (like
> > > the AMO insns emitted further down would), but aiui silent
> > > misbehavior
> > > would result.
> > As I mentioned above with 4-byte alignment and then reading and
> > working
> > with 8-byte then crossing a word or double-word boundary shouldn't
> > be
> > an issue.
> > 
> > I am not sure that I know how to check that we are crossing cache
> > line
> > boundary.
> > 
> > Regarding page boundary, if the next page is mapped then all should
> > work fine, otherwise it will be an exception.
> 
> Are you sure lr.d / sc.d are happy to access across such a boundary,
> when both pages are mapped?
If they are mapped, my expectation that lr.d and sc.d should be happy.

> 
> To me it seems pretty clear that for atomic accesses you want to
> demand natural alignment, i.e. 2-byte alignment for 2-byte accesses.
> This way you can be sure no potentially problematic boundaries will
> be crossed.
It makes sense, but I am not sure that I can guarantee that a user of
macros will always have 2-byte alignment (except during a panic) in the
future.

Even now, I am uncertain that everyone will be willing to add
__alignment(...) to struct vcpu->is_urgent
(xen/include/xen/sched.h:218) and other possible cases to accommodate
RISC-V requirements.

> 
> > > > +        break; \
> > > > +    case 4: \
> > > > +        __amoswap_generic(ptr__, new__, ret__,\
> > > > +                          ".w" sfx,  release_barrier,
> > > > acquire_barrier); \
> > > > +        break; \
> > > > +    case 8: \
> > > > +        __amoswap_generic(ptr__, new__, ret__,\
> > > > +                          ".d" sfx,  release_barrier,
> > > > acquire_barrier); \
> > > > +        break; \
> > > > +    default: \
> > > > +        STATIC_ASSERT_UNREACHABLE(); \
> > > > +    } \
> > > > +    ret__; \
> > > > +})
> > > > +
> > > > +#define xchg_relaxed(ptr, x) \
> > > > +({ \
> > > > +    __typeof__(*(ptr)) x_ = (x); \
> > > > +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_,
> > > > sizeof(*(ptr)),
> > > > "", "", ""); \
> > > > +})
> > > > +
> > > > +#define xchg_acquire(ptr, x) \
> > > > +({ \
> > > > +    __typeof__(*(ptr)) x_ = (x); \
> > > > +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_,
> > > > sizeof(*(ptr)), \
> > > > +                                       "", "",
> > > > RISCV_ACQUIRE_BARRIER); \
> > > > +})
> > > > +
> > > > +#define xchg_release(ptr, x) \
> > > > +({ \
> > > > +    __typeof__(*(ptr)) x_ = (x); \
> > > > +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_,
> > > > sizeof(*(ptr)),\
> > > > +                                       "",
> > > > RISCV_RELEASE_BARRIER,
> > > > ""); \
> > > > +})
> > > > +
> > > > +#define xchg(ptr,x) \
> > > > +({ \
> > > > +    __typeof__(*(ptr)) ret__; \
> > > > +    ret__ = (__typeof__(*(ptr))) \
> > > > +            __xchg_generic(ptr, (unsigned long)(x),
> > > > sizeof(*(ptr)), \
> > > > +                           ".aqrl", "", ""); \
> > > 
> > > The .aqrl doesn't look to affect the (emulated) 1- and 2-byte
> > > cases.
> > > 
> > > Further, amoswap also exists in release-only and acquire-only
> > > forms.
> > > Why do you prefer explicit barrier insns over those? (Looks to
> > > similarly apply to the emulation path as well as to the cmpxchg
> > > machinery then, as both lr and sc also come in all four possible
> > > acquire/release forms. Perhaps for the emulation path using
> > > explicit barriers is better, in case the acquire/release forms of
> > > lr/sc - being used inside the loop - might perform worse.)
> > As 1- and 2-byte cases are emulated I decided that is not to
> > provide
> > sfx argument for emulation macros as it will not have to much
> > affect on
> > emulated types and just consume more performance on acquire and
> > release
> > version of sc/ld instructions.
> 
> Question is whether the common case (4- and 8-byte accesses)
> shouldn't
> be valued higher, with 1- and 2-byte emulation being there just to
> allow things to not break altogether.
If I understand you correctly, it would make sense to add the 'sfx'
argument for the 1/2-byte access case, ensuring that all options are
available for 1/2-byte access case as well.


~ Oleksii


Reply via email to