R: R: R: About hardfloat in ppc

2020-04-30 Thread Dino Papararo
Maybe the fastest way to implement hardfloats for ppc could be run them by 
default and until some fpu instruction request for FPSCR register.
At this time probably we want to check for some exception.. so QEMU could come 
back to last fpu instruction executed and re-execute it in softfloat taking 
care this time of FPSCR flags, then continue in hardfloats unitl another 
instruction looking for FPSCR register and so on..

Dino

-Messaggio originale-
Da: BALATON Zoltan  
Inviato: giovedì 30 aprile 2020 17:36
A: 罗勇刚(Yonggang Luo) 
Cc: Richard Henderson ; Dino Papararo 
; qemu-devel@nongnu.org; Programmingkid 
; qemu-...@nongnu.org; Howard Spoelstra 
; Alex Bennée 
Oggetto: Re: R: R: About hardfloat in ppc

On Thu, 30 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
> I propose a new way to computing the float flags, We preserve a  float 
> computing cash typedef struct FpRecord {  uint8_t op;
>  float32 A;
>  float32 B;
> }  FpRecord;
> FpRecord fp_cache[1024];
> int fp_cache_length;
> uint32_t fp_exceptions;
>
> 1. For each new fp operation we push it to the  fp_cache, 2. Once we 
> read the fp_exceptions , then we re-compute the fp_exceptions by 
> re-running the fp FpRecord sequence.
> and clear  fp_cache_length.
> 3. If we clear the fp_exceptions , then we set fp_cache_length to 0 
> and clear  fp_exceptions.
> 4. If the  fp_cache are full, then we re-compute the fp_exceptions by 
> re-running the fp FpRecord sequence.
>
> Would this be a general method to use hard-float?
> The consued time should be  2*hard_float.
> Considerating read fp_exceptions are rare, then the amortized time 
> complexity would be 1 * hard_float.

It's hard to guess what the hit rate of such cache would be and if it's low 
then managing the cache is probably more expensive than running with softfloat. 
So to evaluate any proposed patch we also need some benchmarks which we can 
experiment with to tell if the results are good or not otherwise we're just 
guessing. Are there some existing tests and benchmarks that we can use? Alex 
mentioned fp-bench I think and to evaluate the correctness of the FP 
implementation I've seen this other
conversation:

https://lists.nongnu.org/archive/html/qemu-devel/2020-04/msg05107.html
https://lists.nongnu.org/archive/html/qemu-devel/2020-04/msg05126.html

Is that something we can use for PPC as well to check the correctness?

So I think before implementing any potential solution that came up in this 
brainstorming the first step would be to get and compile (or write if not
available) some tests and benchmarks:

1. testing host behaviour for inexact and compare that for different archs 2. 
some FP tests that can be used to compare results with QEMU and real CPU to 
check correctness of emulation (if these check for inexact differences then 
could be used instead of 1.) 3. some benchmarks to evaluate QEMU performance 
(these could be same as FP tests or some real world FP heavy applications).

Then we can see if the proposed solution is faster and still correct.

Regards,
BALATON Zoltan


R: R: About hardfloat in ppc

2020-04-29 Thread Dino Papararo
Hi Alex,
maybe a pseudo code can show better what I mean 

if (ppc_fpu_instruction == USE_FPSCR) /* instruction have dot '.' so FPSCR will 
be updated and we need have care about it */
soft_decode (ppc_fpu_instruction)
else  /* instruction have not dot '.' and FPSCR will be never updated and we 
don't need to have care about it -> maxspeed */
hard_decode (ppc_fpu_instruction)

In ppc assembly all instructions who needs to take care of inexact flag and/or 
exception flags, are processed prior than test instructions, look at following 
exception handling example:

   fadd. f0,f1,f2 # f1 + f2 = f0. CR1 contains except.summary
   bta   4,error  # if bit 0 of CR1 is set, go to error
  # bit 0 is set if any exception occurs
   .  # if clear, continue operation
   .
   .
error:
   mcrfs 2,1   # copy FPSCR bits 4-7 to CR field 2
   # now CR1 and CR2 (bits 6 through 10)
   # contain all exception bits from FPSCR
   bta   6,invalid   # CR bit 6 signals invalid
   bta   7,overflow  # CR bit 7 signals overflow
   bta   8,underflow # CR bit 8 signals underflow
   bta   9,divbyzero # CR bit 9 signals divide-by-zero
   bta   10,inexact  # CR bit 10 signals inexact

invalid:
   mcrfs 2,2   # copy FPSCR bits 8-11 to CR field 2
   mcrfs 3,3   # copy FPSCR bits 12-15 to CR field 3
   mcrfs 4,5   # copy FPSCR bits 20-23 to CR field 4
   # invalid bits are now CR bits 11-16 and bit 23

   # now do exception handling based on which invalid bit
   # is set

overflow:
   # do exception handling for overflow exception

underflow:
   # do exception handling for underflow exception

divbyzero:
   #do exception handling for the divide-by-zero exception

inexact:
   # do exception handling for the inexact exception

In this way you can know as soon as possible if you can go with hardfloats or 
not.

I leave to you TCG's experts how it works and how to implement it, I'm only 
tryng to explain a possible fast way to go (if ever possible) 
..Large majority of software don't check for exceptions at all and if I really 
want to pursue max precision I'll go for a software multiprecision library like 
GMP or MPFR Libraries.
So the hardfloats 'should' be set as first choice and only if instruction 
requires precision/error check process it in softfloats.

I hope to have added some new ideas to discussion, thank a lot Alex!

Dino

-Messaggio originale-
Da: Alex Bennée  
Inviato: mercoledì 29 aprile 2020 13:57
A: Dino Papararo 
Cc: luoyongg...@gmail.com; BALATON Zoltan ; Mark 
Cave-Ayland ; Programmingkid 
; Howard Spoelstra ; 
qemu-...@nongnu.org; qemu-devel@nongnu.org
Oggetto: Re: R: About hardfloat in ppc


Dino Papararo  writes:

> Hello,
> about handling of PPC fpu exceptions and Hard Floats support we could 
> consider a different approach for different instructions.
> i.e. not all fpu instructions take care about inexact or exceptions bits: if 
> I take a simple fadd f0,f1,f2 I'll copy value derived from adding f1+f2 into 
> f1 register and no one will check about inexact or exception bits raised into 
> FPSCR register.
> Instead if I'll take fadd. f0,f1,f2 the dot following the add instructions 
> means I want take inexact or exceptions bits into account.
> So I could use hard floats for first case and softfloats for second case.
> Could this be a fast solution to start implement hard floats for PPC??

While it may be true that normal software practice is not to read the exception 
registers for every operation we can't base our emulation on that. We must 
always be able to re-create the state of the exception registers whenever they 
may be read by the program. There are 3 cases this may happen:

  - a direct read of the inexact register
  - checking the sigcontext of a synchronous exception (e.g. fault)
  - checking the sigcontext of an asynchronous exception (e.g. timer/IPI)

Given the way the translator works we can simplify the asynchronous case 
because we know they are only ever delivered at the start of translated blocks. 
We must have a fully rectified system state at the end of every block. So lets 
consider some cases:

  fpOpA
  clear flags
  fpOpB
  clear flags
  fpOpC
  read flags

Assuming we know the fpOps can't generate exceptions we can know that only 
fpOpC will ever generate a user visible floating point flags so we can indeed 
use hardfloat for fpOpA and fpOpB. However if we see the
pattern:

  fpOpA
  ld/st
  clear flags
  fpOpB
  read flags

we must have the fully rectified version of the flags because the ld/st may 
fault. However it's not guaranteed it will fault so we could defer the flag 
calculation for fpOpA until such time as we need it. The easiest way would be 
to save the values going into the operation and then re-run it in softfloat 
when required (hopefully never ;-).

A lot will depend on the behaviour of the architecture. For example:

  fpOpA
  fpOpB
  read flag

R: About hardfloat in ppc

2020-04-29 Thread Dino Papararo
Typo correction  

" if I take a simple fadd f0,f1,f2 I'll copy value derived from adding f1+f2 
into f0 register"

-Messaggio originale-
Da: Qemu-ppc  Per conto di Dino 
Papararo
Inviato: mercoledì 29 aprile 2020 12:18
A: Alex Bennée ; luoyongg...@gmail.com; BALATON Zoltan 
; Mark Cave-Ayland ; 
Programmingkid ; Howard Spoelstra 

Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org
Oggetto: R: About hardfloat in ppc

Hello,
about handling of PPC fpu exceptions and Hard Floats support we could consider 
a different approach for different instructions.
i.e. not all fpu instructions take care about inexact or exceptions bits: if I 
take a simple fadd f0,f1,f2 I'll copy value derived from adding f1+f2 into f1 
register and no one will check about inexact or exception bits raised into 
FPSCR register.
Instead if I'll take fadd. f0,f1,f2 the dot following the add instructions 
means I want take inexact or exceptions bits into account.
So I could use hard floats for first case and softfloats for second case.
Could this be a fast solution to start implement hard floats for PPC??

A little of documentation here: 
http://mirror.informatimago.com/next/developer.apple.com/documentation/mac/PPCNumerics/PPCNumerics-154.html

Regards,
Dino Papararo

-Messaggio originale-
Da: Qemu-devel  Per conto di 
Alex Bennée
Inviato: martedì 28 aprile 2020 10:37
A: luoyongg...@gmail.com
Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org
Oggetto: Re: About hardfloat in ppc


罗勇刚(Yonggang Luo)  writes:

> I am confusing why only  inexact  are set then we can use hard-float.

The inexact behaviour of the host hardware may be different from the guest 
architecture we are trying to emulate and the host hardware may not be 
configurable to emulate the guest mode.

Have a look in softfloat.c and see all the places where float_flag_inexact is 
set. Can you convince yourself that the host hardware will do the same?

> And PPC always clearing inexact  flag before calling to soft-float 
> funcitons. so we can not optimize it with hard-float.
> I need some resouces about ineact flag and why always clearing inexcat 
> in PPC FP simualtion.

Because that is the behaviour of the PPC floating point unit. The inexact flag 
will represent the last operation done.

> I am looking for two possible solution:
> 1. do not clear inexact flag in PPC simulation 2. even the inexact are 
> cleared, we can still use alternative hard-float.
>
> But now I am the beginner, Have no clue about all the things.

Well you'll need to learn about floating point because these are rather 
fundamental aspects of it's behaviour. In the old days QEMU used to use the 
host floating point processor with it's template based translation.
However this led to lots of weird bugs because the floating point answers under 
qemu where different from the target it was trying to emulate. It was for this 
reason softfloat was introduced. The hardfloat optimisation can only be done 
when we are confident that we will get the exact same answer of the target we 
are trying to emulate - a "faster but incorrect" mode is just going to cause 
confusion as discussed in the previous thread. Have you read that yet?

>
> On Mon, Apr 27, 2020 at 7:10 PM Alex Bennée  wrote:
>
>>
>> BALATON Zoltan  writes:
>>
>> > On Mon, 27 Apr 2020, Alex Bennée wrote:
>> >> 罗勇刚(Yonggang Luo)  writes:
>> >>> Because ppc fpu-helper are always clearing float_flag_inexact, So 
>> >>> is that possible to optimize the performance when
>> float_flag_inexact
>> >>> are cleared?
>> >>
>> >> There was some discussion about this in the last thread about 
>> >> enabling hardfloat for PPC. See the thread:
>> >>
>> >>  Subject: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
>> >>  Date: Tue, 18 Feb 2020 18:10:16 +0100
>> >>  Message-Id: <20200218171702.979f0746...@zero.eik.bme.hu>
>> >
>> > I've answered this already with link to that thread here:
>> >
>> > On Fri, 10 Apr 2020, BALATON Zoltan wrote:
>> > : Date: Fri, 10 Apr 2020 20:04:53 +0200 (CEST)
>> > : From: BALATON Zoltan 
>> > : To: "罗勇刚(Yonggang Luo)" 
>> > : Cc: qemu-devel@nongnu.org, Mark Cave-Ayland, John Arbuckle,
>> qemu-...@nongnu.org, Paul Clarke, Howard Spoelstra, David Gibson
>> > : Subject: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
>> > :
>> > : On Fri, 10 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
>> > :> Are this stable now? I'd like to see hard float to be landed:)
>> > :
>> > : If you want to see hardfloat for PPC then you should read the 
>> > replies to : this patch which can be found here:
>> > :
>> > : http://patchwork

R: About hardfloat in ppc

2020-04-29 Thread Dino Papararo
Hello,
about handling of PPC fpu exceptions and Hard Floats support we could consider 
a different approach for different instructions.
i.e. not all fpu instructions take care about inexact or exceptions bits: if I 
take a simple fadd f0,f1,f2 I'll copy value derived from adding f1+f2 into f1 
register and no one will check about inexact or exception bits raised into 
FPSCR register.
Instead if I'll take fadd. f0,f1,f2 the dot following the add instructions 
means I want take inexact or exceptions bits into account.
So I could use hard floats for first case and softfloats for second case.
Could this be a fast solution to start implement hard floats for PPC??

A little of documentation here: 
http://mirror.informatimago.com/next/developer.apple.com/documentation/mac/PPCNumerics/PPCNumerics-154.html

Regards,
Dino Papararo

-Messaggio originale-
Da: Qemu-devel  Per conto di 
Alex Bennée
Inviato: martedì 28 aprile 2020 10:37
A: luoyongg...@gmail.com
Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org
Oggetto: Re: About hardfloat in ppc


罗勇刚(Yonggang Luo)  writes:

> I am confusing why only  inexact  are set then we can use hard-float.

The inexact behaviour of the host hardware may be different from the guest 
architecture we are trying to emulate and the host hardware may not be 
configurable to emulate the guest mode.

Have a look in softfloat.c and see all the places where float_flag_inexact is 
set. Can you convince yourself that the host hardware will do the same?

> And PPC always clearing inexact  flag before calling to soft-float 
> funcitons. so we can not optimize it with hard-float.
> I need some resouces about ineact flag and why always clearing inexcat 
> in PPC FP simualtion.

Because that is the behaviour of the PPC floating point unit. The inexact flag 
will represent the last operation done.

> I am looking for two possible solution:
> 1. do not clear inexact flag in PPC simulation 2. even the inexact are 
> cleared, we can still use alternative hard-float.
>
> But now I am the beginner, Have no clue about all the things.

Well you'll need to learn about floating point because these are rather 
fundamental aspects of it's behaviour. In the old days QEMU used to use the 
host floating point processor with it's template based translation.
However this led to lots of weird bugs because the floating point answers under 
qemu where different from the target it was trying to emulate. It was for this 
reason softfloat was introduced. The hardfloat optimisation can only be done 
when we are confident that we will get the exact same answer of the target we 
are trying to emulate - a "faster but incorrect" mode is just going to cause 
confusion as discussed in the previous thread. Have you read that yet?

>
> On Mon, Apr 27, 2020 at 7:10 PM Alex Bennée  wrote:
>
>>
>> BALATON Zoltan  writes:
>>
>> > On Mon, 27 Apr 2020, Alex Bennée wrote:
>> >> 罗勇刚(Yonggang Luo)  writes:
>> >>> Because ppc fpu-helper are always clearing float_flag_inexact, So 
>> >>> is that possible to optimize the performance when
>> float_flag_inexact
>> >>> are cleared?
>> >>
>> >> There was some discussion about this in the last thread about 
>> >> enabling hardfloat for PPC. See the thread:
>> >>
>> >>  Subject: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
>> >>  Date: Tue, 18 Feb 2020 18:10:16 +0100
>> >>  Message-Id: <20200218171702.979f0746...@zero.eik.bme.hu>
>> >
>> > I've answered this already with link to that thread here:
>> >
>> > On Fri, 10 Apr 2020, BALATON Zoltan wrote:
>> > : Date: Fri, 10 Apr 2020 20:04:53 +0200 (CEST)
>> > : From: BALATON Zoltan 
>> > : To: "罗勇刚(Yonggang Luo)" 
>> > : Cc: qemu-devel@nongnu.org, Mark Cave-Ayland, John Arbuckle,
>> qemu-...@nongnu.org, Paul Clarke, Howard Spoelstra, David Gibson
>> > : Subject: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
>> > :
>> > : On Fri, 10 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
>> > :> Are this stable now? I'd like to see hard float to be landed:)
>> > :
>> > : If you want to see hardfloat for PPC then you should read the 
>> > replies to : this patch which can be found here:
>> > :
>> > : http://patchwork.ozlabs.org/patch/1240235/
>> > :
>> > : to understand what's needed then try to implement the solution 
>> > with FP : exceptions cached in a global that maybe could work. I 
>> > won't be able to : do that as said here:
>> > :
>> > : 
>> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg6.htm
>> > l
>> > :
>> > : because I don't have time t

R: R: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC

2020-02-26 Thread Dino Papararo
I think we all agree the best solution is to resolve powerpc issues about 
hardfloat current implementation.
I think also powerpc is an important branch of qemu, for hystorical, present 
and (why not?) future reasons, and it must NOT be left behind.
So I would invite best Qemu community's skilled programmers to work on this and 
solve the issue maybe in few days.
The same group who worked on recent altivec optimizations is able to make a 
good patch even for this.

In a subordinate way I'd like to implement anyway hardfloat support for 
powerpc, advising users about inaccurancy of results/flags and letting them 
choose.
Of course I understand, and in part agree, on all your objections. 
Simply I prefer have always a choice.

Best Regards,
Dino Papararo

-Messaggio originale-
Da: Aleksandar Markovic  
Inviato: mercoledì 26 febbraio 2020 18:27
A: G 3 
Cc: Alex Bennée ; Dino Papararo ; 
QEMU Developers ; qemu-...@nongnu.org; Howard Spoelstra 
; luigi burdo ; David Gibson 

Oggetto: Re: R: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC

On Wed, Feb 26, 2020 at 6:04 PM G 3  wrote:
>
> Accuracy is an important part of the IEEE 754 floating point standard. The 
> whole purpose of this standard is to ensure floating point calculations are 
> consistent across multiple CPUs. I believe referring to this patch as 
> inaccurate is itself inaccurate. That gives the impression that this patch 
> produces calculations that are not inline with established standards. This is 
> not true. The only part of this patch that will produce incorrect values are 
> the flags. There *may* be a program or two out there that depend on these 
> flags, but for the majority of programs that only care about basic floating 
> point arithmetic this patch will produce correct values. Currently the 
> emulated PowerPC's FPU already produces wrong values for the flags. This 
> patch does set the Inexact flag (which I don't like), but since I have never 
> encountered any source code that cares for this flag, I can let it go. I 
> think giving the user the ability to decide which option to use is the best 
> thing to do.
>

From the experiments described above, the patch in question changes the 
behavior of applications (for example, sound is different with and without the 
patch), which is in contradiction with your claim that you "never encountered 
any source code that cares for this flag" and that "the only part of this patch 
that will produce incorrect values are the flags".

In other words, and playing further with them:

The claim that "referring to this patch as inaccurate is itself inaccurate" is 
itself inaccurate.

Best regards,
Aleksandar


> On Wed, Feb 26, 2020 at 10:51 AM Aleksandar Markovic 
>  wrote:
>>
>>
>>
>> On Wed, Feb 26, 2020 at 3:29 PM Alex Bennée  wrote:
>> >
>> >
>> > Dino Papararo  writes:
>> >
>> > > Please let's go with hardfloat pps support, it's really a good feature 
>> > > to implement.
>> > > Even if in a first step it could lead to inaccuracy results, 
>> > > later it could solved with other patches.
>> >
>> > That's the wrong way around. We have regression tests for a reason.
>>
>> I tend to agree with Alex here, and additionally want to expand more 
>> on this topic.
>>
>> In my view: (that I think is at least very close to the community 
>> consensus)
>>
>> This is *not* a ppc-specific issue. There exist a principle across 
>> all targets that QEMU FPU calculation must be accurate - exactly as 
>> specified in any applicable particular ISA document. Any discrepancy is an 
>> outright bug.
>>
>> We even recently had several patches for FPU in ppc target that 
>> handled some fairly obscure cases of inaccuracies, I believe they 
>> were authored by Paul Clarke, so there are people in ppc community 
>> that care about FPU accuracy (as I guess is the case for any target).
>>
>> There shouldn't be a target that decides by itself and within itself 
>> "ok, we don't need accuracy, let's trade it for speed". This violates 
>> the architecture of QEMU. Please allow that for any given software 
>> project, there is an architecture that should be respected.
>>
>> This doesn't mean that anybody's experimentation is discouraged. 
>> No-one can stop anybody from forking from QEMU upstream tree and do 
>> whatever is wanted.
>>
>> But, this doesn't mean such experimentation will be upstreamed. QEMU 
>> upstream should be collecting place for the best ideas and 
>> implementations, not for arbitrary experimentations.
>>
>> Best regards,
>> Aleksandar
>>
>>
>> > I'll happily acc

R: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC

2020-02-26 Thread Dino Papararo
Please let's go with hardfloat pps support, it's really a good feature to 
implement.
Even if in a first step it could lead to inaccuracy results, later it could 
solved with other patches.

I think it's important for qemu to as global as possible and don't target only 
recent hardware.

Regards,
Dino Papararo

Da: Qemu-ppc  Per conto di 
luigi burdo
Inviato: mercoledì 26 febbraio 2020 14:01
A: BALATON Zoltan ; Programmingkid 

Cc: David Gibson ; qemu-...@nongnu.org; qemu-devel 
qemu-devel ; Howard Spoelstra 
Oggetto: R: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC

Hi Zoltan,
i can say MacOs Leopard use multiple cores on PowerMac G5 Quad the most of the 
apps did for  Panter/Tiger/leopard use for sure 2 Core in smtp only apps did 
for Tiger/leopard use more than 2 Cores.
Ciao and thenks
 Luigi



Da: Qemu-ppc 
mailto:qemu-ppc-bounces+intermediadc=hotmail@nongnu.org>>
 per conto di BALATON Zoltan mailto:bala...@eik.bme.hu>>
Inviato: mercoledì 26 febbraio 2020 12:28
A: Programmingkid mailto:programmingk...@gmail.com>>
Cc: Howard Spoelstra mailto:hsp.c...@gmail.com>>; 
qemu-...@nongnu.org<mailto:qemu-...@nongnu.org> 
mailto:qemu-...@nongnu.org>>; qemu-devel qemu-devel 
mailto:qemu-devel@nongnu.org>>; David Gibson 
mailto:da...@gibson.dropbear.id.au>>
Oggetto: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC

On Wed, 26 Feb 2020, Programmingkid wrote:
> I think a timeout takes place and that is why audio stops playing. It is
> probably an USB OHCI issue. The other USB controller seems to work
> better.

Which other USB controller? Maybe you could try enabling some usb_ohci*
traces and see if they reveal anything.

>> The Amiga like OSes I'm interested in don't use multiple cores so I'm
>> mainly interested in improving single core performance. Also I'm not
>> sure if (part of) your problem is slow FPU preventing fast enough audio
>> decoding then having multiple CPUs with slow FPU would help as this may
>> use a single thread anyway.
>
> Good point. MTTCG might be the option that really helps with speed 
> improvements.

Only if you have multithreaded workload in the guest because AFAIK MTTCG
only runs different vcpus in parallel, it won't make single emulated CPU
faster in any way. OSX probably can benefit from having multiple cores
emulated but I don't think MacOS would use it apart from some apps maybe.

Regards,
BALATON Zoltan


[Qemu-devel] R: [PATCH v5 8/8] target/ppc: remove various HOST_WORDS_BIGENDIAN hacks in int_helper.c

2019-02-03 Thread Dino Papararo
Hello Mark,
I have a question about improving speed manually unrolling loops like this

Assuming ARRAY_SIZE(r->u8) is always multiple of 4 you can manually improve 
loop in this way, on modern CPU non sequential instructions can be computed 
nearly for free:

> {
> int i, j = (sh & 0xf);
>
> -VECTOR_FOR_INORDER_I(i, u8) {
> -r->u8[i] = j++;
> +for (i = 0; i < ARRAY_SIZE(r->u8); i+=4,j+=4) {
> +r->VsrB(i) = j;
> +r->VsrB(i+1) = j+1;
> +r->VsrB(i+2) = j+2;
> +r->VsrB(i+3) = j+3; }
> }

In this patch there are a lot of functions can benefit by unrolling loops, with 
a huge speed improvement.
Maybe compiler could do it itself but aren't humans still better? 

Best Regards,
Dino Papararo

-Messaggio originale-
Da: Qemu-devel  Per conto di 
Mark Cave-Ayland
Inviato: mercoledì 30 gennaio 2019 21:37
A: qemu-devel@nongnu.org; qemu-...@nongnu.org; richard.hender...@linaro.org; 
da...@gibson.dropbear.id.au
Oggetto: [Qemu-devel] [PATCH v5 8/8] target/ppc: remove various 
HOST_WORDS_BIGENDIAN hacks in int_helper.c

Following on from the previous work, there are numerous endian-related hacks in 
int_helper.c that can now be replaced with Vsr* macros.

There are also a few places where the VECTOR_FOR_INORDER_I macro can be 
replaced with a normal iterator since the processing order is irrelevant.

Signed-off-by: Mark Cave-Ayland 
Reviewed-by: Richard Henderson 
---
 target/ppc/int_helper.c | 155 ++--
 1 file changed, 45 insertions(+), 110 deletions(-)

diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c index 
916d10c25b..8efc283388 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -443,8 +443,8 @@ void helper_lvsl(ppc_avr_t *r, target_ulong sh)  {
 int i, j = (sh & 0xf);
 
-VECTOR_FOR_INORDER_I(i, u8) {
-r->u8[i] = j++;
+for (i = 0; i < ARRAY_SIZE(r->u8); i++) {
+r->VsrB(i) = j++;
 }
 }
 
@@ -452,18 +452,14 @@ void helper_lvsr(ppc_avr_t *r, target_ulong sh)  {
 int i, j = 0x10 - (sh & 0xf);
 
-VECTOR_FOR_INORDER_I(i, u8) {
-r->u8[i] = j++;
+for (i = 0; i < ARRAY_SIZE(r->u8); i++) {
+r->VsrB(i) = j++;
 }
 }
 
 void helper_mtvscr(CPUPPCState *env, ppc_avr_t *r)  { -#if 
defined(HOST_WORDS_BIGENDIAN)
-env->vscr = r->u32[3];
-#else
-env->vscr = r->u32[0];
-#endif
+env->vscr = r->VsrW(3);
 set_flush_to_zero(vscr_nj, >vec_status);  }
 
@@ -870,8 +866,8 @@ target_ulong helper_vclzlsbb(ppc_avr_t *r)  {
 target_ulong count = 0;
 int i;
-VECTOR_FOR_INORDER_I(i, u8) {
-if (r->u8[i] & 0x01) {
+for (i = 0; i < ARRAY_SIZE(r->u8); i++) {
+if (r->VsrB(i) & 0x01) {
 break;
 }
 count++;
@@ -883,12 +879,8 @@ target_ulong helper_vctzlsbb(ppc_avr_t *r)  {
 target_ulong count = 0;
 int i;
-#if defined(HOST_WORDS_BIGENDIAN)
 for (i = ARRAY_SIZE(r->u8) - 1; i >= 0; i--) { -#else
-for (i = 0; i < ARRAY_SIZE(r->u8); i++) {
-#endif
-if (r->u8[i] & 0x01) {
+if (r->VsrB(i) & 0x01) {
 break;
 }
 count++;
@@ -1137,18 +1129,14 @@ void helper_vperm(CPUPPCState *env, ppc_avr_t *r, 
ppc_avr_t *a, ppc_avr_t *b,
 ppc_avr_t result;
 int i;
 
-VECTOR_FOR_INORDER_I(i, u8) {
-int s = c->u8[i] & 0x1f;
-#if defined(HOST_WORDS_BIGENDIAN)
+for (i = 0; i < ARRAY_SIZE(r->u8); i++) {
+int s = c->VsrB(i) & 0x1f;
 int index = s & 0xf;
-#else
-int index = 15 - (s & 0xf);
-#endif
 
 if (s & 0x10) {
-result.u8[i] = b->u8[index];
+result.VsrB(i) = b->VsrB(index);
 } else {
-result.u8[i] = a->u8[index];
+result.VsrB(i) = a->VsrB(index);
 }
 }
 *r = result;
@@ -1160,18 +1148,14 @@ void helper_vpermr(CPUPPCState *env, ppc_avr_t *r, 
ppc_avr_t *a, ppc_avr_t *b,
 ppc_avr_t result;
 int i;
 
-VECTOR_FOR_INORDER_I(i, u8) {
-int s = c->u8[i] & 0x1f;
-#if defined(HOST_WORDS_BIGENDIAN)
+for (i = 0; i < ARRAY_SIZE(r->u8); i++) {
+int s = c->VsrB(i) & 0x1f;
 int index = 15 - (s & 0xf);
-#else
-int index = s & 0xf;
-#endif
 
 if (s & 0x10) {
-result.u8[i] = a->u8[index];
+result.VsrB(i) = a->VsrB(index);
 } else {
-result.u8[i] = b->u8[index];
+result.VsrB(i) = b->VsrB(index);
 }
 }
 *r = result;
@@ -1868,25 +1852,14 @@ void helper_vsldoi(ppc_avr_t *r, ppc_avr_t *a, 
ppc_avr_t *b, uint32_t shift)
 int i;
 ppc_avr_t result;
 
-#if defined(HOST_WORDS_BIGENDIAN)
 for (i = 0; i < ARRAY_SIZE(r->u8); i++) {
 int index = sh + i;