Re: [Qemu-devel] [PATCH 08/10] target-avr: adding instruction translation

2016-06-06 Thread Michael Rolnik
On Mon, Jun 6, 2016 at 10:17 PM, Richard Henderson  wrote:

> On 06/05/2016 11:52 PM, Michael Rolnik wrote:
>
>> truth table shows that these computations are different.
>>
>
> You're not giving the right inputs to the truth table.
>
> you can't look onto 4th bit because 4th bits in the input were not 0s.
>>
>
> What did you think the xor's do?  They remove the non-zero input bits.
>
> #include 
>
> static int orig(int r, int d, int s)
> {
>   return (((d & s) | (d & ~r) | (s & ~r)) & 2) != 0;
> }
>
> static int mine(int r, int d, int s)
> {
>   return (((d ^ s) ^ r) & 4) != 0;
> }
>
> int main()
> {
>   int s, d;
>   for (s = 0; s < 8; ++s)
> for (d = 0; d < 8; ++d)
>   {
> int r = d + s;
> int o = orig(r, d, s);
> int m = mine(r, d, s);
>
> if (o != m)
>   printf("%2d = %d + %d  (o=%d, m=%d)\n", r, d, s, o, m);
>   }
>   return 0;
> }
>
> This performs tests on 3-bit inputs, testing for carry-out on bit 1, just
> like Hf computes carry-out on bit 3.

you are right I can look onto 4th bit, however I do the whole computation
of C flag already.

>
>
>
> Then you've got the order of the stores wrong.  Your code pushes the
>> LSB
>> before pushing the MSB, which results in the MSB at the lower address,
>> which means big-endian.
>>
>> this is right. However as far as I understand AVR is neither little nor
>> big
>> endian because there it's 8 bit architecture (see
>> here http://www.avrfreaks.net/forum/endian-issue). for time being I
>> defined the
>> platform to be little endian with ret address exception
>>
>
> True, AVR is an 8-bit core, where endianness doesn't (normally) apply.
> And you are right that ADIW does treat the registers as little-endian.
>
> But the only multi-byte store to memory is in big-endian order.  So why
> wouldn't you want to take advantage of that fact?

I will.

>
>
> You have swapped the overflow conditions for INC and DEC.
>>
> ...
>
>> this is how it's defined in the document.
>>
>
> No, it isn't.  Look again, you've swapped them.

checking again.

>
>
>
> r~
>



-- 
Best Regards,
Michael Rolnik


Re: [Qemu-devel] [PATCH 08/10] target-avr: adding instruction translation

2016-06-06 Thread Richard Henderson

On 06/05/2016 11:52 PM, Michael Rolnik wrote:

truth table shows that these computations are different.


You're not giving the right inputs to the truth table.


you can't look onto 4th bit because 4th bits in the input were not 0s.


What did you think the xor's do?  They remove the non-zero input bits.

#include 

static int orig(int r, int d, int s)
{
  return (((d & s) | (d & ~r) | (s & ~r)) & 2) != 0;
}

static int mine(int r, int d, int s)
{
  return (((d ^ s) ^ r) & 4) != 0;
}

int main()
{
  int s, d;
  for (s = 0; s < 8; ++s)
for (d = 0; d < 8; ++d)
  {
int r = d + s;
int o = orig(r, d, s);
int m = mine(r, d, s);

if (o != m)
  printf("%2d = %d + %d  (o=%d, m=%d)\n", r, d, s, o, m);
  }
  return 0;
}

This performs tests on 3-bit inputs, testing for carry-out on bit 1, just like 
Hf computes carry-out on bit 3.




Then you've got the order of the stores wrong.  Your code pushes the LSB
before pushing the MSB, which results in the MSB at the lower address,
which means big-endian.

this is right. However as far as I understand AVR is neither little nor big
endian because there it's 8 bit architecture (see
here http://www.avrfreaks.net/forum/endian-issue). for time being I defined the
platform to be little endian with ret address exception


True, AVR is an 8-bit core, where endianness doesn't (normally) apply.  And you 
are right that ADIW does treat the registers as little-endian.


But the only multi-byte store to memory is in big-endian order.  So why 
wouldn't you want to take advantage of that fact?



You have swapped the overflow conditions for INC and DEC.

...

this is how it's defined in the document.


No, it isn't.  Look again, you've swapped them.


r~



Re: [Qemu-devel] [PATCH 08/10] target-avr: adding instruction translation

2016-06-06 Thread Michael Rolnik
On Mon, Jun 6, 2016 at 2:34 AM, Richard Henderson  wrote:

> On 06/05/2016 02:47 PM, Michael Rolnik wrote:
>
>> Is there a reason this code isn't going into translate.c?
>> You wouldn't need the declarations in translate-inst.h or translate.h.
>>
>> I see here two levels of logic
>> a. instruction translation
>> b. general flow of program translation.
>>
>
> FWIW, static functions can be automatically inlined by the compiler,
> whereas external function calls can't.  In theory, the compiler could
> auto-inline the entire translator into a single function.

these functions are called through pointer so there is no way for these
functions to be inlined.

>
>
> Order these functions properly and you don't need forward declarations.
>>
>> is it a requirements? this way it look cleaner.
>>
>
> Does it?  In my experience it just means you've got to edit two places
> when one changes things.

It still one place because it's either instruction translation or a program
translation.
from my point of view translate.c will have almost no bugs in very close
feature whereas translate-inst.c will hove bug fixes and modification.
having it in two files will isolate translate.c from erroneous
modifications.

>
>
> While this is exactly the formula in the manual, it's also equal to
>>
>> ((Rd ^ Rr) ^ R) & 16
>>
>> Please explain. I don't see it.
>>
>>
>> http://www.wolframalpha.com/input/?i=A+and+B+or+A+and+not+C+or+B+and+not+C,+A+xor+B+xor+C
>>
>
> I did explain:

truth table shows that these computations are different.

>
>
> where we examine the difference between the non-carry addition (Rd ^
>> Rr)
>> and the carry addition (R) to find the carry out of bit 3.  This
>> reduces
>> the operation count to 2 from 5.
>>
>
> It's not a manipulation of the original expression, but a different way of
> looking at the problem.
>
> You want to compute carry-out of bit 3.  Given the *result* of the
> addition, it's easier to examine bit 4, into which carry-in has happened,
> rather than examine bit 3 and re-compute the carry-out.
>
> The AVR hardware probably computes it exactly as described in the manual,
> because that can be done in parallel with the addition, and has a lower
> total gate latency.  This is fairly common in the industry, where the
> documentation follows the implementation more closely than it perhaps
> should.

you can't look onto 4th bit because 4th bits in the input were not 0s.

>
>
> Note that carry and borrow are related, and thus this is *also*
>> computable
>> via ((Rd ^ Rr) ^ R) on bit 4.
>>
>> please explain, I don't see it
>>
>> http://www.wolframalpha.com/input/?i=not+A+and+B+or+not+A+and+C+or++C+and+B,+A+xor+B+xor+C
>>
>
> As above, given the *result* of the subtraction, examining bit 4 into
> which borrow-in has happened.
>
> Once you accept that, you'll note that the same expression can be used to
> re-create both carry-in and borrow-in.
>
> I'll also note that the piece-wise store is big-endian, so you could
>> perform this in 1 store for 2_BYTE and 2 stores for 3_BYTE.
>>
>> I got an expression that the platform is little endian.
>>
>
> Then you've got the order of the stores wrong.  Your code pushes the LSB
> before pushing the MSB, which results in the MSB at the lower address,
> which means big-endian.

this is right. However as far as I understand AVR is neither little nor big
endian because there it's 8 bit architecture (see here
http://www.avrfreaks.net/forum/endian-issue). for time being I defined the
platform to be little endian with ret address exception

>
>
> Wow.  Um... Surely it would be better to store X and Y internally as
>> whole
>> 24-bit quantities, and Z as a 16-bit quantity (to be extended with
>> rampz,
>> rampd, or eind as needed).
>>
>> rampX/Y/Z are represented now as 0x00ff.
>> X/Y/Z can be represented as 16 bits registers, however I do not know if
>> and
>> when r26-r31 are used as 8 bits, so if X/Y/Z are represented as 16 bits it
>> would be hard to use r26-r31 in arithmetics
>>
>
> You would use a setup like the following, and use these functions instead
> of other direct accesses to the cpu registers.  This setup requires similar
> functions in cpu.h for use by e.g. gdbstub.c.
>
I will have to think about it.

>
>
> TCGv cpu_rb[24];
> TCGv cpu_rw[4];
>
> TCGv read_byte(unsigned rb)
> {
> TCGv byte = tcg_temp_new();
> if (rb < 24) {
> tcg_gen_mov_tl(byte, cpu_rb[rb]);
> } else {
> unsigned rw = (rb - 24) / 2;
> if (rb & 1) {
> tcg_gen_shri_tl(byte, cpu_rw[rw]);
> } else {
> tcg_gen_ext8u_tl(byte, cpu_rw[rw]);
> }
> }
> return byte;
> }
>
> void write_byte(unsigned rb, TCGv val)
> {
> if (rb < 24) {
> tcg_gen_mov_tl(cpu_rb[rb], val);
> } else {
> unsigned rw = (rb - 24) / 2;
> tcg_gen_deposit_tl(cpu_rw[rw], cpu_rw[rw], val, (rb & 1) * 8, 8);
> }
> }
>
> /* Return 

Re: [Qemu-devel] [PATCH 08/10] target-avr: adding instruction translation

2016-06-05 Thread Richard Henderson

On 06/05/2016 02:47 PM, Michael Rolnik wrote:

Is there a reason this code isn't going into translate.c?
You wouldn't need the declarations in translate-inst.h or translate.h.

I see here two levels of logic
a. instruction translation
b. general flow of program translation.


FWIW, static functions can be automatically inlined by the compiler, whereas 
external function calls can't.  In theory, the compiler could auto-inline the 
entire translator into a single function.



Order these functions properly and you don't need forward declarations.

is it a requirements? this way it look cleaner.


Does it?  In my experience it just means you've got to edit two places when one 
changes things.



While this is exactly the formula in the manual, it's also equal to

((Rd ^ Rr) ^ R) & 16

Please explain. I don't see it.

http://www.wolframalpha.com/input/?i=A+and+B+or+A+and+not+C+or+B+and+not+C,+A+xor+B+xor+C


I did explain:


where we examine the difference between the non-carry addition (Rd ^ Rr)
and the carry addition (R) to find the carry out of bit 3.  This reduces
the operation count to 2 from 5.


It's not a manipulation of the original expression, but a different way of 
looking at the problem.


You want to compute carry-out of bit 3.  Given the *result* of the addition, 
it's easier to examine bit 4, into which carry-in has happened, rather than 
examine bit 3 and re-compute the carry-out.


The AVR hardware probably computes it exactly as described in the manual, 
because that can be done in parallel with the addition, and has a lower total 
gate latency.  This is fairly common in the industry, where the documentation 
follows the implementation more closely than it perhaps should.



Note that carry and borrow are related, and thus this is *also* computable
via ((Rd ^ Rr) ^ R) on bit 4.

please explain, I don't see it
http://www.wolframalpha.com/input/?i=not+A+and+B+or+not+A+and+C+or++C+and+B,+A+xor+B+xor+C


As above, given the *result* of the subtraction, examining bit 4 into which 
borrow-in has happened.


Once you accept that, you'll note that the same expression can be used to 
re-create both carry-in and borrow-in.



I'll also note that the piece-wise store is big-endian, so you could
perform this in 1 store for 2_BYTE and 2 stores for 3_BYTE.

I got an expression that the platform is little endian.


Then you've got the order of the stores wrong.  Your code pushes the LSB before 
pushing the MSB, which results in the MSB at the lower address, which means 
big-endian.



Wow.  Um... Surely it would be better to store X and Y internally as whole
24-bit quantities, and Z as a 16-bit quantity (to be extended with rampz,
rampd, or eind as needed).

rampX/Y/Z are represented now as 0x00ff.
X/Y/Z can be represented as 16 bits registers, however I do not know if and
when r26-r31 are used as 8 bits, so if X/Y/Z are represented as 16 bits it
would be hard to use r26-r31 in arithmetics


You would use a setup like the following, and use these functions instead of 
other direct accesses to the cpu registers.  This setup requires similar 
functions in cpu.h for use by e.g. gdbstub.c.



TCGv cpu_rb[24];
TCGv cpu_rw[4];

TCGv read_byte(unsigned rb)
{
TCGv byte = tcg_temp_new();
if (rb < 24) {
tcg_gen_mov_tl(byte, cpu_rb[rb]);
} else {
unsigned rw = (rb - 24) / 2;
if (rb & 1) {
tcg_gen_shri_tl(byte, cpu_rw[rw]);
} else {
tcg_gen_ext8u_tl(byte, cpu_rw[rw]);
}
}
return byte;
}

void write_byte(unsigned rb, TCGv val)
{
if (rb < 24) {
tcg_gen_mov_tl(cpu_rb[rb], val);
} else {
unsigned rw = (rb - 24) / 2;
tcg_gen_deposit_tl(cpu_rw[rw], cpu_rw[rw], val, (rb & 1) * 8, 8);
}
}

/* Return RB+1:RB.  */
TCGv read_word(unsigned rb)
{
TCGv word = tcg_temp_new();
if (rb < 24) {
tcg_gen_deposit_tl(word, cpu_rb[rb], cpu_rb[rb + 1], 8, 8);
} else {
unsigned rw = (rb - 24) / 2;
tcg_gen_mov_tl(word, cpu_rw[rw]);
}
return word;
}

void write_word(unsigned rb, TCGv val)
{
if (rb < 24) {
tcg_gen_ext8u_tl(cpu_rb[rb], val);
tcg_gen_shri_tl(cpu_rb[rb + 1], val, 8);
} else {
unsigned rw = (rb - 24) / 2;
tcg_gen_mov_tl(cpu_rw[rw], val);
}
}


+intavr_translate_DEC(CPUAVRState *env, DisasContext *ctx, uint32_t

...

+tcg_gen_setcondi_tl(TCG_COND_EQ, cpu_Vf, Rd, 0x7f);  /* cpu_Vf   =
Rd == 0x7f  */


This is INC overflow.

please explain, I don't see a problem here


You have swapped the overflow conditions for INC and DEC.

127 + 1 -> -128
-128 - 1 -> 127


r~



Re: [Qemu-devel] [PATCH 08/10] target-avr: adding instruction translation

2016-06-05 Thread Michael Rolnik
please see my answer inside.

On Sun, Jun 5, 2016 at 6:27 AM, Richard Henderson  wrote:

> On 06/02/2016 01:07 PM, Michael Rolnik wrote:
>
>> Signed-off-by: Michael Rolnik 
>> ---
>>  target-avr/translate-inst.c | 2443
>> +++
>>
>
> Is there a reason this code isn't going into translate.c?
> You wouldn't need the declarations in translate-inst.h or translate.h.

I see here two levels of logic
a. instruction translation
b. general flow of program translation.


>
>
> +/*
>> +NOTE:   all registers are assumed to hold 8 bit values.
>> +so all operations done on registers should preseve this
>> property
>> +*/
>> +
>> +/*
>> +NOTE:   the flags C,H,V,N,V have either 0 or 1 values
>> +NOTE:   the flag Z has inverse logic, when the value of Zf is 0 the
>> flag is assumed to be set, non zero - not set
>> +*/
>>
>
> This documentation belongs with the declarations in cpu.h.

moved.

>
>
> +voidgen_add_CHf(TCGv R, TCGv Rd, TCGv Rr);
>> +voidgen_add_Vf(TCGv R, TCGv Rd, TCGv Rr);
>> +voidgen_sub_CHf(TCGv R, TCGv Rd, TCGv Rr);
>> +voidgen_sub_Vf(TCGv R, TCGv Rd, TCGv Rr);
>> +voidgen_ZNSf(TCGv R);
>> +voidgen_push_ret(CPUAVRState *env, intret);
>> +voidgen_pop_ret(CPUAVRState *env, TCGvret);
>> +voidgen_jmp_ez(void);
>> +voidgen_jmp_z(void);
>> +
>> +voidgen_set_addr(TCGv addr, TCGv H, TCGv M, TCGv l); /*  H:M:L   =
>> addr  */
>> +voidgen_set_xaddr(TCGv addr);
>> +voidgen_set_yaddr(TCGv addr);
>> +voidgen_set_zaddr(TCGv addr);
>> +
>> +TCGvgen_get_addr(TCGv H, TCGv M, TCGv L);/*  addr =
>> H:M:L*/
>> +TCGvgen_get_xaddr(void);
>> +TCGvgen_get_yaddr(void);
>> +TCGvgen_get_zaddr(void);
>> +int sex(int Imm, unsigned bits);
>>
>
> Order these functions properly and you don't need forward declarations.

is it a requirements? this way it look cleaner.

>
>
> +
>> +voidgen_add_CHf(TCGv R, TCGv Rd, TCGv Rr)
>> +{
>> +TCGvt1  = tcg_temp_new_i32();
>> +TCGvt2  = tcg_temp_new_i32();
>> +TCGvt3  = tcg_temp_new_i32();
>> +
>> +tcg_gen_and_tl(t1, Rd, Rr);/*  t1  = Rd & Rr  */
>> +tcg_gen_not_tl(t2, R); /*  t2  = Rd & ~R  */
>> +tcg_gen_and_tl(t2, Rd, t2);
>>
>
> tcg_gen_andc_tl.

thanks

>
>
> +tcg_gen_not_tl(t3, R); /*  t3  = Rr  *~R  */
>> +tcg_gen_and_tl(t3, Rr, t3);
>>
>
> Likewise.

thanks

>
>
> +tcg_gen_or_tl(t1, t1, t2);/*  t1  = t1 | t2 | t3  */
>> +tcg_gen_or_tl(t1, t1, t3);
>>
>
> While this is exactly the formula in the manual, it's also equal to
>
> ((Rd ^ Rr) ^ R) & 16
>
Please explain. I don't see it.

http://www.wolframalpha.com/input/?i=A+and+B+or+A+and+not+C+or+B+and+not+C,+A+xor+B+xor+C


>
>
> where we examine the difference between the non-carry addition (Rd ^ Rr)
> and the carry addition (R) to find the carry out of bit 3.  This reduces
> the operation count to 2 from 5.
>
> +tcg_gen_shri_tl(cpu_Cf, t1, 7); /*  Cf  = t1(7)  */
>>
>
> Of course, Cf is also bit 8 of R, at least before you masked that off.
>
> +tcg_gen_shri_tl(cpu_Hf, t1, 3); /*  Hf  = t1(3)  */
>> +tcg_gen_andi_tl(cpu_Hf, cpu_Hf, 1);
>>
>
> Consider leaving Hf in bit 3 (or 4, as above) instead of shifting and
> masking to bit 0.  This makes it (slightly) more complicated to read the
> full SREG value, but it saves two instructions per arithmetic instruction.
> This will add up quickly.
>
yes you are right. next version.

>
> One can make the same argument for most of the other flags.
>
> Compare how this is handled in target-arm for cpu_[NZCV]F.
>
> +voidgen_add_Vf(TCGvR, TCGvRd, TCGvRr)
>> +{
>> +TCGvt1  = tcg_temp_new_i32();
>> +TCGvt2  = tcg_temp_new_i32();
>> +
>> +tcg_gen_not_tl(t1, Rd);/*  t1  = ~Rd & ~Rr & R  */
>> +tcg_gen_not_tl(t2, Rr);
>> +tcg_gen_and_tl(t1, t1, t2);
>> +tcg_gen_and_tl(t1, t1, R);
>> +
>> +tcg_gen_not_tl(t2, R); /*  t2  = Rd & Rr & ~R  */
>> +tcg_gen_and_tl(t2, t2, Rd);
>> +tcg_gen_and_tl(t2, t2, Rr);
>> +
>> +tcg_gen_or_tl(t1, t1, t2);/*  t1  = Rd & Rr & ~R | ~Rd & ~Rr &
>> R  */
>>
>
> While this is indeed the formula in the manual, a better expression is
>
>   (R ^ Rd) & ~(Rd ^ Rr)
>
right, thanks.

>
> which is 3 operations instead of 5.  As alluded above, it might be best to
> keep Vf in bit 7 instead of bit 0.
>
> Also note that if you use bit 4 to compute Hf, as above, then one can
> share a sub-expression with the computation of Vf.
>
> +voidgen_sub_CHf(TCGvR, TCGvRd, TCGvRr)
>> +{
>> +TCGvt1  = tcg_temp_new_i32();
>> +TCGvt2  = tcg_temp_new_i32();
>> +TCGvt3  = tcg_temp_new_i32();
>> +
>> +/*  Cf & Hf  */
>> +tcg_gen_not_tl(t1, Rd);/*  t1  = ~Rd  */
>> +tcg_gen_and_tl(t2, t1, Rr);/*  t2  = ~Rd & Rr  */
>> +

Re: [Qemu-devel] [PATCH 08/10] target-avr: adding instruction translation

2016-06-04 Thread Richard Henderson

On 06/02/2016 01:07 PM, Michael Rolnik wrote:

Signed-off-by: Michael Rolnik 
---
 target-avr/translate-inst.c | 2443 +++


Is there a reason this code isn't going into translate.c?
You wouldn't need the declarations in translate-inst.h or translate.h.


+/*
+NOTE:   all registers are assumed to hold 8 bit values.
+so all operations done on registers should preseve this property
+*/
+
+/*
+NOTE:   the flags C,H,V,N,V have either 0 or 1 values
+NOTE:   the flag Z has inverse logic, when the value of Zf is 0 the flag 
is assumed to be set, non zero - not set
+*/


This documentation belongs with the declarations in cpu.h.


+voidgen_add_CHf(TCGv R, TCGv Rd, TCGv Rr);
+voidgen_add_Vf(TCGv R, TCGv Rd, TCGv Rr);
+voidgen_sub_CHf(TCGv R, TCGv Rd, TCGv Rr);
+voidgen_sub_Vf(TCGv R, TCGv Rd, TCGv Rr);
+voidgen_ZNSf(TCGv R);
+voidgen_push_ret(CPUAVRState *env, intret);
+voidgen_pop_ret(CPUAVRState *env, TCGvret);
+voidgen_jmp_ez(void);
+voidgen_jmp_z(void);
+
+voidgen_set_addr(TCGv addr, TCGv H, TCGv M, TCGv l); /*  H:M:L   = addr  */
+voidgen_set_xaddr(TCGv addr);
+voidgen_set_yaddr(TCGv addr);
+voidgen_set_zaddr(TCGv addr);
+
+TCGvgen_get_addr(TCGv H, TCGv M, TCGv L);/*  addr = H:M:L*/
+TCGvgen_get_xaddr(void);
+TCGvgen_get_yaddr(void);
+TCGvgen_get_zaddr(void);
+int sex(int Imm, unsigned bits);


Order these functions properly and you don't need forward declarations.


+
+voidgen_add_CHf(TCGv R, TCGv Rd, TCGv Rr)
+{
+TCGvt1  = tcg_temp_new_i32();
+TCGvt2  = tcg_temp_new_i32();
+TCGvt3  = tcg_temp_new_i32();
+
+tcg_gen_and_tl(t1, Rd, Rr);/*  t1  = Rd & Rr  */
+tcg_gen_not_tl(t2, R); /*  t2  = Rd & ~R  */
+tcg_gen_and_tl(t2, Rd, t2);


tcg_gen_andc_tl.


+tcg_gen_not_tl(t3, R); /*  t3  = Rr  *~R  */
+tcg_gen_and_tl(t3, Rr, t3);


Likewise.


+tcg_gen_or_tl(t1, t1, t2);/*  t1  = t1 | t2 | t3  */
+tcg_gen_or_tl(t1, t1, t3);


While this is exactly the formula in the manual, it's also equal to

((Rd ^ Rr) ^ R) & 16

where we examine the difference between the non-carry addition (Rd ^ Rr) and 
the carry addition (R) to find the carry out of bit 3.  This reduces the 
operation count to 2 from 5.



+tcg_gen_shri_tl(cpu_Cf, t1, 7); /*  Cf  = t1(7)  */


Of course, Cf is also bit 8 of R, at least before you masked that off.


+tcg_gen_shri_tl(cpu_Hf, t1, 3); /*  Hf  = t1(3)  */
+tcg_gen_andi_tl(cpu_Hf, cpu_Hf, 1);


Consider leaving Hf in bit 3 (or 4, as above) instead of shifting and masking 
to bit 0.  This makes it (slightly) more complicated to read the full SREG 
value, but it saves two instructions per arithmetic instruction.  This will add 
up quickly.


One can make the same argument for most of the other flags.

Compare how this is handled in target-arm for cpu_[NZCV]F.


+voidgen_add_Vf(TCGvR, TCGvRd, TCGvRr)
+{
+TCGvt1  = tcg_temp_new_i32();
+TCGvt2  = tcg_temp_new_i32();
+
+tcg_gen_not_tl(t1, Rd);/*  t1  = ~Rd & ~Rr & R  */
+tcg_gen_not_tl(t2, Rr);
+tcg_gen_and_tl(t1, t1, t2);
+tcg_gen_and_tl(t1, t1, R);
+
+tcg_gen_not_tl(t2, R); /*  t2  = Rd & Rr & ~R  */
+tcg_gen_and_tl(t2, t2, Rd);
+tcg_gen_and_tl(t2, t2, Rr);
+
+tcg_gen_or_tl(t1, t1, t2);/*  t1  = Rd & Rr & ~R | ~Rd & ~Rr & R  */


While this is indeed the formula in the manual, a better expression is

  (R ^ Rd) & ~(Rd ^ Rr)

which is 3 operations instead of 5.  As alluded above, it might be best to keep 
Vf in bit 7 instead of bit 0.


Also note that if you use bit 4 to compute Hf, as above, then one can share a 
sub-expression with the computation of Vf.



+voidgen_sub_CHf(TCGvR, TCGvRd, TCGvRr)
+{
+TCGvt1  = tcg_temp_new_i32();
+TCGvt2  = tcg_temp_new_i32();
+TCGvt3  = tcg_temp_new_i32();
+
+/*  Cf & Hf  */
+tcg_gen_not_tl(t1, Rd);/*  t1  = ~Rd  */
+tcg_gen_and_tl(t2, t1, Rr);/*  t2  = ~Rd & Rr  */
+tcg_gen_or_tl(t3, t1, Rr);/*  t3  = (~Rd | Rr) & R  */
+tcg_gen_and_tl(t3, t3, R);
+tcg_gen_or_tl(t2, t2, t3);/*  t2  = ~Rd & Rr | ~Rd & R | R & Rr  */
+tcg_gen_shri_tl(cpu_Cf, t2, 7); /*  Cf  = t2(7)  */
+tcg_gen_shri_tl(cpu_Hf, t2, 3); /*  Hf  = t2(3)  */
+tcg_gen_andi_tl(cpu_Hf, cpu_Hf, 1);


Note that carry and borrow are related, and thus this is *also* computable via 
((Rd ^ Rr) ^ R) on bit 4.



+voidgen_sub_Vf(TCGvR, TCGvRd, TCGvRr)
+{
+TCGvt1  = tcg_temp_new_i32();
+TCGvt2  = tcg_temp_new_i32();
+
+/*  Vf  */
+tcg_gen_and_tl(t1, Rr, R); /*  t1  = Rd & ~Rr & ~R  */
+tcg_gen_not_tl(t1, t1);
+tcg_gen_and_tl(t1, t1, Rd);
+tcg_gen_not_tl(t2, Rd);/*  t2  = ~Rd &