Re: [PosibleSpam] Re: z constraint in powerpc inline assembly ?
On Thu, Jan 16, 2020 at 07:57:29AM -0600, Segher Boessenkool wrote: > On Thu, Jan 16, 2020 at 09:06:08AM +0100, Gabriel Paubert wrote: > > On Thu, Jan 16, 2020 at 07:11:36AM +0100, Christophe Leroy wrote: > > > Hi Segher, > > > > > > I'm trying to see if we could enhance TCP checksum calculations by > > > splitting > > > inline assembly blocks to give GCC the opportunity to mix it with other > > > stuff, but I'm getting difficulties with the carry. > > > > > > As far as I can read in the documentation, the z constraint represents > > > '‘XER[CA]’ carry bit (part of the XER register)' > > > > Well, the documentation is very optimisitic. From the GCC source code > > (thanks for switching to git last week-end ;-)), it is clear that the > > carry is not, for the time being, properly modeled. > > What? It certainly *is*, I spent ages on that back in 2014 and before. > See gcc.gnu.org/PR64180 etc. > > You can not put the carry as input or output to an asm, of course: no C > variable can be assigned to it. > > We don't do the "flag outputs" thing, either, as it is largely useless > for Power (and using it would often make *worse* code). > > If you want to access a carry, write C code that does that operation. > The compiler knows how to optimise it well. > > > Right now, in the machine description, all setters and users of the carry > > are in the same block of generated instructions. > > No, they are not. For over five years now. (Since GCC 5). > > > For a start, all single instructions patterns that set the carry (and > > do not use it) as a side effect should mention the they clobber the > > carry, otherwise inserting one between a setter and a user of the carry > > would break. > > And they do. > Apologies, I don't know how I could misread the .md files this badly. Indeed I see everything now that you mention it. I'm still a bit surprised that I have found zero "z" constraints in the whole gcc/config/rs6000 directory. Everything seems to be CA_REGNO. > All asms that change the carry should mention that, too, but this is > automatically done for all inline asms, because there was a lot of code > in the wild that does not clobber it. I was not aware of this, anyway I would always put as correct as possible clobbers for my inline assembly code. > > > This includes all arithmetic right shift (sra[wd]{,i}, > > subfic, addic{,\.} and I may have forgotten some. > > {add,subf}{ic,c,e,ze,me} and sra[wd][i] and their dots. Sure. And > mcrxr and mcrxrx and mfxer and mtxer. That's about it. Yes, but are last ones (the moves) are ever generated by the compiler? Looking at the source (again) it seems that even lswi has disappeared. > > We don't model the second carry at all yet btw, in GCC. Not too many > people know it exists even, so no big loss there. > Anyway, I couldn't use it. I tried to buy a Talos II at work but management made it too complex to negotiate. The problem was not the money, but the paperwork :-(. Now my most powerful PPC machine is a 17" Powerbook G4. > (One nasty was that addi. does not exist, so we used addic. where it was > wanted before, so that had to change.) > > > Segher Regards, Gabriel
RE: z constraint in powerpc inline assembly ?
> You mean the mpc8xx , but I'm also using the mpc832x which has a e300c2 > core and is capable of executing 2 insns in parallel if not in the same > Unit. That should let you do a memory read and an add. (I can't remember if the ppc has 'add from memory' but that is likely to use both units anyway.) An infinitely unrolled loop will then be 4 clocks/byte (for 32bit). If you get to 3 for a real loop you are doing ok. Remember, unroll too much and you displace other code from the i-cache. Also the i-cache loads themselves kill you. (A hot-cache benchmark won't see this...) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: z constraint in powerpc inline assembly ?
Le 16/01/2020 à 17:21, Segher Boessenkool a écrit : Christophe uses a very primitive 32-bit cpu, not even superscalar. A loop doing adde is pretty much optimal, probably wants some unrolling though. You mean the mpc8xx , but I'm also using the mpc832x which has a e300c2 core and is capable of executing 2 insns in parallel if not in the same Unit. Christophe
RE: z constraint in powerpc inline assembly ?
From: Segher Boessenkool > Sent: 16 January 2020 16:22 ... > > However a loop of 'add with carry' instructions may not be the > > fastest code by any means. > > Because the carry flag is needed for every 'adc' you can't do more > > that one adc per clock. > > This limits you to 8 bytes/clock on a 64bit system - even one > > that can schedule multiple memory reads and lots of instructions > > every clock. > > > > I don't know ppc, but on x86 you don't even get 1 adc per clock > > until very recent (Haswell I think) cpus. > > Sandy/Ivy bridge will do so if you add to alternate registers. > > The carry bit is renamed just fine on all modern Power cpus. On Power9 > there is an extra carry bit, precisely so you can do two interleaved > chains. And you can run lots of these insns at once, every cycle. The limitation on old x86 was that each u-op could only have 2 inputs. Since adc needs 3 it always took 2 clocks. The first 'fix' still had an extra delay on the result register. There is also a big problem of false dependencies against the flags. PPC may not have this problem, but it makes it very difficult to loop carry any of the flags. Using 'dec' (which doesn't affect carry, but does set zero) is really slow. Even though the latest x86 cpu have ADOX and ADCX (that use the overflow and carry flags) and can run in parallel the LOOP 'dec jump non-zero' instruction is microcoded and serialising! I have got 12 bytes/clock without too much unrolling, but it is hard work and probably not worth the effort. ... > Christophe uses a very primitive 32-bit cpu, not even superscalar. A > loop doing adde is pretty much optimal, probably wants some unrolling > though. Or interleaving so it does read_a, [read_b, adc_a, read_a, adc_b]* adc_a. That might be enough to get the loop 'for free' if there are memory stalls. > Do normal 64-bit adds, and in parallel also accumulate the values shifted > right by 32 bits. You can add 4G of them this way, and restore the 96-bit > actual sum from these two accumulators, so that you can fold it to a proper > ones' complement sum after the loop. That is probably too many instructions per word - unless you are using simd ones. > But you can easily beat 8B/clock using vectors, or doing multiple addition > chains (interleaved) in parallel. Not that it helps, your limiting factor > is the memory bandwidth anyway, if anything in the memory pipeline stalls > all your optimisations are for nothing. Yep, if the data isn't in the L1 cache anything complex is a waste of time. Unrolling too much just makes the top/bottom code take too long and then it dominates for a lot of 'real world' buffers. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: z constraint in powerpc inline assembly ?
Hi! On Thu, Jan 16, 2020 at 03:54:58PM +, David Laight wrote: > if you are trying to 'loop carry' the 'carry flag' with 'add with carry' > instructions you'll almost certainly need to write the loop in asm. > Since the loop itself is simple, this probably doesn't matter. Agreed. > However a loop of 'add with carry' instructions may not be the > fastest code by any means. > Because the carry flag is needed for every 'adc' you can't do more > that one adc per clock. > This limits you to 8 bytes/clock on a 64bit system - even one > that can schedule multiple memory reads and lots of instructions > every clock. > > I don't know ppc, but on x86 you don't even get 1 adc per clock > until very recent (Haswell I think) cpus. > Sandy/Ivy bridge will do so if you add to alternate registers. The carry bit is renamed just fine on all modern Power cpus. On Power9 there is an extra carry bit, precisely so you can do two interleaved chains. And you can run lots of these insns at once, every cycle. On older cpus there were other limitations as well, but those have been solved essentially. > For earlier cpu it is actually difficult to beat the 4 bytes/clock > you get by adding 32bit values to a 64bit register in C code. Christophe uses a very primitive 32-bit cpu, not even superscalar. A loop doing adde is pretty much optimal, probably wants some unrolling though. > One possibility is to do a normal add then shift the carry > into a separate register. > After 64 words use 'popcnt' to sum the carry bits. > With 2 accumulators (and carry shifts) you'd need to > break the loop every 1024 bytes. > This should beat 8 bytes/clock if you can exeute more than > 1 memory read, one add and one shift each clock. Do normal 64-bit adds, and in parallel also accumulate the values shifted right by 32 bits. You can add 4G of them this way, and restore the 96-bit actual sum from these two accumulators, so that you can fold it to a proper ones' complement sum after the loop. But you can easily beat 8B/clock using vectors, or doing multiple addition chains (interleaved) in parallel. Not that it helps, your limiting factor is the memory bandwidth anyway, if anything in the memory pipeline stalls all your optimisations are for nothing. Segher
RE: z constraint in powerpc inline assembly ?
From: Christophe Leroy > Sent: 16 January 2020 06:12 > > I'm trying to see if we could enhance TCP checksum calculations by > splitting inline assembly blocks to give GCC the opportunity to mix it > with other stuff, but I'm getting difficulties with the carry. if you are trying to 'loop carry' the 'carry flag' with 'add with carry' instructions you'll almost certainly need to write the loop in asm. Since the loop itself is simple, this probably doesn't matter. However a loop of 'add with carry' instructions may not be the fastest code by any means. Because the carry flag is needed for every 'adc' you can't do more that one adc per clock. This limits you to 8 bytes/clock on a 64bit system - even one that can schedule multiple memory reads and lots of instructions every clock. I don't know ppc, but on x86 you don't even get 1 adc per clock until very recent (Haswell I think) cpus. Sandy/Ivy bridge will do so if you add to alternate registers. For earlier cpu it is actually difficult to beat the 4 bytes/clock you get by adding 32bit values to a 64bit register in C code. One possibility is to do a normal add then shift the carry into a separate register. After 64 words use 'popcnt' to sum the carry bits. With 2 accumulators (and carry shifts) you'd need to break the loop every 1024 bytes. This should beat 8 bytes/clock if you can exeute more than 1 memory read, one add and one shift each clock. I've not tried this on an old x86 cpu - which would need a software 'popcnt'. It got close to 8 bytes/clock on Ivy bridge. It almost certainly beats the 4 bytes/clock of the current x86-64 code on such systems. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: z constraint in powerpc inline assembly ?
Hi! On Thu, Jan 16, 2020 at 07:11:36AM +0100, Christophe Leroy wrote: > I'm trying to see if we could enhance TCP checksum calculations by > splitting inline assembly blocks to give GCC the opportunity to mix it > with other stuff, but I'm getting difficulties with the carry. > > As far as I can read in the documentation, the z constraint represents > '‘XER[CA]’ carry bit (part of the XER register)' > > I've tried the following, but I get errors. Can you help ? > > unsigned long cksum(unsigned long a, unsigned long b, unsigned long c) > { > unsigned long sum; > unsigned long carry; > > asm("addc %0, %2, %3" : "=r"(sum), "=z"(carry) : "r"(a), "r"(b)); > asm("adde %0, %0, %2" : "+r"(sum), "+z"(carry) : "r"(c)); > asm("addze %0, %0" : "+r"(sum) : "z"(carry)); > > return sum; > } The only register allowed by "z" is a fixed register. You cannot use "z" in inline asm. Just write this as C? It should do a reasonable job of it. If you want *good* code, you need to write it in *actual* assembler code, anyway (hand scheduled and everything). Segher
Re: z constraint in powerpc inline assembly ?
On Thu, Jan 16, 2020 at 09:06:08AM +0100, Gabriel Paubert wrote: > On Thu, Jan 16, 2020 at 07:11:36AM +0100, Christophe Leroy wrote: > > Hi Segher, > > > > I'm trying to see if we could enhance TCP checksum calculations by splitting > > inline assembly blocks to give GCC the opportunity to mix it with other > > stuff, but I'm getting difficulties with the carry. > > > > As far as I can read in the documentation, the z constraint represents > > '‘XER[CA]’ carry bit (part of the XER register)' > > Well, the documentation is very optimisitic. From the GCC source code > (thanks for switching to git last week-end ;-)), it is clear that the > carry is not, for the time being, properly modeled. What? It certainly *is*, I spent ages on that back in 2014 and before. See gcc.gnu.org/PR64180 etc. You can not put the carry as input or output to an asm, of course: no C variable can be assigned to it. We don't do the "flag outputs" thing, either, as it is largely useless for Power (and using it would often make *worse* code). If you want to access a carry, write C code that does that operation. The compiler knows how to optimise it well. > Right now, in the machine description, all setters and users of the carry > are in the same block of generated instructions. No, they are not. For over five years now. (Since GCC 5). > For a start, all single instructions patterns that set the carry (and > do not use it) as a side effect should mention the they clobber the > carry, otherwise inserting one between a setter and a user of the carry > would break. And they do. All asms that change the carry should mention that, too, but this is automatically done for all inline asms, because there was a lot of code in the wild that does not clobber it. > This includes all arithmetic right shift (sra[wd]{,i}, > subfic, addic{,\.} and I may have forgotten some. {add,subf}{ic,c,e,ze,me} and sra[wd][i] and their dots. Sure. And mcrxr and mcrxrx and mfxer and mtxer. That's about it. We don't model the second carry at all yet btw, in GCC. Not too many people know it exists even, so no big loss there. (One nasty was that addi. does not exist, so we used addic. where it was wanted before, so that had to change.) Segher
Re: z constraint in powerpc inline assembly ?
On Thu, Jan 16, 2020 at 07:11:36AM +0100, Christophe Leroy wrote: > Hi Segher, > > I'm trying to see if we could enhance TCP checksum calculations by splitting > inline assembly blocks to give GCC the opportunity to mix it with other > stuff, but I'm getting difficulties with the carry. > > As far as I can read in the documentation, the z constraint represents > '‘XER[CA]’ carry bit (part of the XER register)' Well, the documentation is very optimisitic. From the GCC source code (thanks for switching to git last week-end ;-)), it is clear that the carry is not, for the time being, properly modeled. Right now, in the machine description, all setters and users of the carry are in the same block of generated instructions. For a start, all single instructions patterns that set the carry (and do not use it) as a side effect should mention the they clobber the carry, otherwise inserting one between a setter and a user of the carry would break. This includes all arithmetic right shift (sra[wd]{,i}, subfic, addic{,\.} and I may have forgotten some. If you want to future proof your code just in case, you should also add an "xer" clobber to all instruction sequences that may modify the carry bit. But any inline assembly that touches XER might break if GCC is ugraded to properly model the carry bit, and a lot of code might need to be audited. Gabriel > > I've tried the following, but I get errors. Can you help ? > > unsigned long cksum(unsigned long a, unsigned long b, unsigned long c) > { > unsigned long sum; > unsigned long carry; > > asm("addc %0, %2, %3" : "=r"(sum), "=z"(carry) : "r"(a), "r"(b)); > asm("adde %0, %0, %2" : "+r"(sum), "+z"(carry) : "r"(c)); > asm("addze %0, %0" : "+r"(sum) : "z"(carry)); > > return sum; > } > > > > csum.c: In function 'cksum': > csum.c:6:2: error: inconsistent operand constraints in an 'asm' > asm("addc %0, %2, %3" : "=r"(sum), "=z"(carry) : "r"(a), "r"(b)); > ^ > csum.c:7:2: error: inconsistent operand constraints in an 'asm' > asm("adde %0, %0, %2" : "+r"(sum), "+z"(carry) : "r"(c)); > ^ > csum.c:8:2: error: inconsistent operand constraints in an 'asm' > asm("addze %0, %0" : "+r"(sum) : "z"(carry)); > ^ > > Thanks > Christophe >
z constraint in powerpc inline assembly ?
Hi Segher, I'm trying to see if we could enhance TCP checksum calculations by splitting inline assembly blocks to give GCC the opportunity to mix it with other stuff, but I'm getting difficulties with the carry. As far as I can read in the documentation, the z constraint represents '‘XER[CA]’ carry bit (part of the XER register)' I've tried the following, but I get errors. Can you help ? unsigned long cksum(unsigned long a, unsigned long b, unsigned long c) { unsigned long sum; unsigned long carry; asm("addc %0, %2, %3" : "=r"(sum), "=z"(carry) : "r"(a), "r"(b)); asm("adde %0, %0, %2" : "+r"(sum), "+z"(carry) : "r"(c)); asm("addze %0, %0" : "+r"(sum) : "z"(carry)); return sum; } csum.c: In function 'cksum': csum.c:6:2: error: inconsistent operand constraints in an 'asm' asm("addc %0, %2, %3" : "=r"(sum), "=z"(carry) : "r"(a), "r"(b)); ^ csum.c:7:2: error: inconsistent operand constraints in an 'asm' asm("adde %0, %0, %2" : "+r"(sum), "+z"(carry) : "r"(c)); ^ csum.c:8:2: error: inconsistent operand constraints in an 'asm' asm("addze %0, %0" : "+r"(sum) : "z"(carry)); ^ Thanks Christophe