Re: [toolchain-devel] poor code generation with volatile loads

Jie Zhang Tue, 22 Dec 2009 20:28:10 -0800

On 12/23/2009 08:52 AM, Mike Frysinger wrote:

in looking at converting the current bfin read/write mmrs from inline
assembly load/store to a volatile pointer read/write, i noticed some
bad behavior when working with smaller than 32bit types.  i vaguely
recall talking about this kind of thing before, but i cant find it in
my mail archives, and i think it had more to do with the inline
assembly rather than volatile loads.


for example, if the bfin read/write looked like (ignore the "no 8-bit
mmrs issue"):
#define bfin_read16(addr) (*(volatile unsigned char *)(addr))
#define bfin_read16(addr) (*(volatile unsigned short *)(addr))
#define bfin_read32(addr) (*(volatile unsigned int *)(addr))
#define bfin_write16(addr, val) (*(volatile unsigned char *)(addr) = (val))
#define bfin_write16(addr, val) (*(volatile unsigned short *)(addr) = (val))
#define bfin_write32(addr, val) (*(volatile unsigned int *)(addr) = (val))
#define bfin_read(addr) \
({ \
     sizeof(*(addr)) == 1 ? bfin_read8(addr)  : \
     sizeof(*(addr)) == 2 ? bfin_read16(addr) : \
     sizeof(*(addr)) == 4 ? bfin_read32(addr) : \
     ({BUG(); 0;}); \
})

and we do something like:
unsigned int foo16(unsigned short *v)
{
     return bfin_read(v) + bfin_read(v + 4);
}

had i used inline assembly (what we have now), we'd get:
         P2 = R0;
#APP
         R1 = w[P2] (z);
#NO_APP
         P2 += 8;
#APP
         R0 = w[P2] (z);
#NO_APP
         R0 = R0 + R1;
         rts;

but with volatiles, gcc will give us:
         P2 = R0;
         R1 = W [P2] (Z); // 1
         R0 = W [P2+8] (X); // 2
         R0 = R0.L (Z); // 3
         R1 = R1.L (Z); // 4
         R0 = R0 + R1;
         rts;

this is better and worse than using inline assembly.  the upside is
that gcc will use address offsets the majority of the time ([P2 + 8]).
  this is exactly what we wanted.  but it seems we need to fix some
things in our compiler first before we can make the switch ...

first, the extensions used in (1) and (2) are "random".  sometimes i
get (Z) and sometimes i get (X).  for example, had i used the 8bit
version, i see (X) and (X).  depending on how i change the code, the
extension seems to change.  but considering these are unsigned types,
i would expect these to always be (Z).  had they been signed, i would
have expected (X).

This is a side effect of the below one. Since there will be (3) and (4).It does not matter if (Z) or (X) is used in load.

second, gcc seems to miss the fact that the hardware took care of the
extension.  (3) and (4) shouldnt exist at all because the load already
did the right thing via (Z) and (X) of setting the higher parts of the
register.

This seems same as an upstream bug:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36873

If you remove volatile, you will got the following expected sequence.

so ideally, i would have expected the generated code:
         P2 = R0;
         R1 = W [P2] (Z);
         R0 = W [P2+8] (Z);
         R0 = R0 + R1;
         rts;

does gcc need fixing (and i'll file a tracker item), or am i missing
something stupid here ?

That bug was set as "INVALID" with comment "We don't optimize volatileaccesses." But I don't know why volatile will prevent this kind ofoptimization. I need to dig more or ask on gcc mailing list.



Jie
_______________________________________________
Toolchain-devel mailing list
Toolchain-devel@blackfin.uclinux.org
https://blackfin.uclinux.org/mailman/listinfo/toolchain-devel

Re: [toolchain-devel] poor code generation with volatile loads

Reply via email to