Hi Vineet,

On Mon, 2019-06-10 at 15:55 +0000, Vineet Gupta wrote:
> On 6/8/19 11:21 AM, Eugeniy Paltsev wrote:
> > Hi Cupertino,
> > 
> > I tried to use ".bundle_align_mode" directive in ARC assembly, but I got 
> > following error:
> > ----------------->8--------------
> > Assembler messages:
> > Error: unknown pseudo-op: `.bundle_align_mode'
> > ----------------->8--------------
> > 
> > Is it possible to implement it in ARC assembler?
> > There is some context about the reason we want to have it:
> > 
> > I'm trying to add support of jump labels for ARC in linux kernel. Jump 
> > labels
> > provide an interface to generate static branches using self-modifying code.
> > This allows us to implement conditional branches where changing branch
> > direction is expensive but branch selection is basically 'free'.
> > 
> > There is nuance in current implementation:
> > We need to patch code by rewriting 32-bit NOP by 32-bit BRANCH instruction 
> > (or vice versa).
> > It can be easily done with following code:
> > ----------------->8--------------
> > write_32_bit(new_instruction)
> > flush_l1_dcache_range_this_cpu
> > invalidate_l1_icache_range_all_cpu
> > ----------------->8--------------
> > 
> > I$ update will be atomic in most of cases except the patched instruction 
> > share
> > two L1 cache lines (so first 16 bits of instruction are in the one cache 
> > line and
> > last 16 bit are in another cache line).
> > In such case we can execute half-updated instruction if we are patching 
> > live code (and we are unlucky enough :)
> 
> While I understand your need for alignment, I don't see how you can possibly
> execute stray lines.
> dcache flush will be propagated by hardware (SCU) to all cores (as 
> applicable) and
> the icache cache flush xcall is synchronous and will have to finish on all 
> cores
> before we proceed to execute the cod eitself.
> 

It's easy as hell - we may execute some code on one CPU and patch it on another 
CPU at the same moment :)

And here is the example of the situation when we can get the issue:
- Let's imagine we have SMP system with CPU#0 and CPU#1.
- We have instruction which share two L1 cache lines:
~ ---------------------------------|--------------------------------- ~
~            |instruction-Z (32-bit instruction we patch)|            ~
~ ---------------------------------|--------------------------------- ~
~   cache-line-X                   | cache-line-Y                     ~
~ ---------------------------------|--------------------------------- ~

CPU#0 is trying to switch our static branch, so it will patch the code 
(instruction-Z)
CPU#1 is executing this code with our static branch, so it execute the code 
(with instruction-Z) that will be patched.

1) CPU#0: we generate new instruction (to replace 'instruction-Z')
   and write it with 32-bit store (st). It is written to CPU#0 L1 data cache.
2) CPU#0: we call our function flush_icache_range.
   It firstly will flush L1 data cache region on CPU#0.
   In our example it will flush CPU#0 L1 'cache-line-X' and 'cache-line-Y' to 
L2 cache.
3) CPU#1: we are executing code which is in 'cache-line-X'.
   'cache-line-Y' is __NOT__ in the L1 instruction cache of CPU#1.
   We need to execute 'instruction-Z', but only half of the instruction is in 
L1 instruction cache.
   CPU#1 fetch 'cache-line-Y' to L1 instruction cache.
4)  Ooops: now we have corrupted 'instruction-Z' in L1 instruction cache of 
CPU#1:
   First 16 bit of this instruction (which belongs to 'cache-line-X') are old 
(not patched).
   Last 16 bit of this instruction (which belongs to 'cache-line-Y') are new 
(patched). 

> > As of today I simply align by 4 byte instruction which can be patched with 
> > ".balign 4" directive:
> > ----------------->8--------------
> > static __always_inline bool arch_static_branch_jump(struct static_key *key,
> >     bool branch)
> > {
> > asm_volatile_goto(".balign 4\n"
> >  "1:\n"
> >  "b %l[l_yes]\n" // <- instruction which can be patched
> >  ".pushsection __jump_table, \"aw\"\n"
> >  ".word 1b, %l[l_yes], %c0\n"
> >  ".popsection\n"
> >  : : "i" (&((char *)key)[branch]) : : l_yes);
> > 
> > return false;
> > l_yes:
> > return true;
> > }
> > ----------------->8--------------
> > 
> > In that case patched instruction is aligned with one 16-bit NOP if this is 
> > required.
> > However 'align by 4' directive is much stricter than it actually required.
> 
> I don't quite understand. Can u write a couple of lines of pseudo assembly to 
> show
> what the issue is.

All instructions on ARCv2 are aligned by 2 byte (even if the instruction is 
4-byte long).

Using '.balign' directive we align instruction which can be
patched by 4 byte.
So compiler will add one 'nop_s' before our instruction if it is not 4 byte 
aligned.

So this code
------------------->8---------------
-----
 .balign 4
 1:
 b %l[l_yes]   #<- instruction which can be patched
------------------->8--------------------
may turn into this:
-----------------
-->8--------------------
bla-bla-bla
 b l_yes       #<- instruction which can be patched
bla-bla-bla
------------------->8--------------------
or that:
----
--------------->8--------------------
bla-bla-bla
 nop_s         # padding
 b l_yes       #<- instruction which can be patched
bla-bla-bla
----------------
--->8--------------------

In most of the cases this extra 'nop_s' is unnecessary as it is fine to have 
our instruction
not 4 byte aligned if it doesn't cross l1 cache line boundary.

It is't the huge problem - but we are adding one unnecessary 'nop' while we try 
to optimize hot code.

> 
> > It's enough
> > that our 32-bit instruction don't cross l1 cache line boundary.
> > That will save us from adding useless NOP padding in most of the cases.
> > It can be implemented with ".bundle_align_mode" directive which isn't 
> > supported by ARC AS unfortunately.
> 
> This seems like a reasonable request (contingent to the difficulty of
> implementation in binutils). but I can't comprehend why you would need it.

If we will have ".bundle_align_mode" directive supported we can add extra 
'nop_s' only if it's really required
(if our instruction _really_ cross L1 cache line boundary). So we won't add 
useless NOP to hot code.

-- 
 Eugeniy Paltsev

Reply via email to