RE: spec2k comparison of gcc 4.1 and 4.2 on AMD K8

2007-02-27 Thread Menezes, Evandro
Richard, > Well, both britten and haydn are single core, two processor > systems. For > SPEC2k6 runs the problem is that the 2gb ram of the machine are > distributed over both numa nodes, so with the memory requirements of > SPEC2k6 we always get inter-node memory traffic. Vangelis is a single

RE: spec2k comparison of gcc 4.1 and 4.2 on AMD K8

2007-02-27 Thread Menezes, Evandro
Nick, > I thought that L2 caches on the Opteron communicated by I > assume by your > response the Opteron memory controller doesn't allow cache > propagation, > instead invalidates the cache entries read (assuming again the write > entries are handled differently). You're half right. The

RE: spec2k comparison of gcc 4.1 and 4.2 on AMD K8

2007-02-27 Thread Menezes, Evandro
Honza, > Well, rather than unstable, they seems to be more memory layout > sensitive I would say. (the differences are more or less reproducible, > not completely random, but independent on the binary itself. I can't > think of much else than memory layout to cause it). I always wondered > if th

RE: I need some advice for x86_64-pc-mingw32 va_list calling convention (in i386.c)

2007-02-23 Thread Menezes, Evandro
See http://msdn2.microsoft.com/en-us/library/ms235286(VS.80).aspx. HTH -- ___ Evandro Menezes AMDAustin, TX > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On > Behalf Of Kai Tietz

RE: Serious SPEC CPU 2006 FP performance regressions on IA32

2006-12-13 Thread Menezes, Evandro
> Meissner, Michael wrote: > >>> 437.leslie3d -26% > > it was felt that the PPRE patches that were added on > November 13th were > > the cause of the slowdown: > > http://gcc.gnu.org/ml/gcc/2006-12/msg00023.html > > > > Has anybody tried doing a run with just ppre disabled? > > Right. PPRE

RE: Serious SPEC CPU 2006 FP performance regressions on IA32

2006-12-11 Thread Menezes, Evandro
HJ, > > Gcc 4.3 revision 119497 has very poor SPEC CPU 2006 FP performance > > regressions on P4, Pentium M and Core Duo, comparing aganst > > gcc 4.2 20060910. With -O2, the typical regressions look like > > > > Gcc 4.2 Gcc 4.3 > > 410.bwaves 9.89

RE: Serious SPEC CPU 2006 FP performance regressions on IA32

2006-12-08 Thread Menezes, Evandro
[EMAIL PROTECTED] > Sent: Friday, December 08, 2006 12:06 > To: gcc@gcc.gnu.org > Cc: Menezes, Evandro; [EMAIL PROTECTED] > Subject: Serious SPEC CPU 2006 FP performance regressions on IA32 > > Gcc 4.3 revision 119497 has very poor SPEC CPU 2006 FP performance > regressions on

RE: Re: Visibility=hidden for x86_64 Xen builds -- problems?

2006-09-29 Thread Menezes, Evandro
Jan, > Xen gets compiled with -fPIC, and we recently added a global > visibility > pragma to avoid the cost of going through the GOT for all access to > global data objects (PIC isn't really needed here, all we need is > sufficient compiler support to get the final image located outside the > +/

RE: Google group for generic System V Application Binary Interface

2006-09-28 Thread Menezes, Evandro
HJ, I think that it's great that all the de facto changes adopted for i386 would be put in an extension or appendix to its psABI. However, I lean towards an open discussion list. If necessary, I'd be glad to investigate hosting this list at http://www.x86-64.org, even though this discussion

RE: x86 Linux stack alignment requirement

2006-06-08 Thread Menezes, Evandro
> > I see. Provided a local is passed in a register to a > non-vararg function, it is still OK to align the stack. > > Given that we don't support 4 byte aligned stack at all with XMM > regisrers, I would prefer to increase Linux/x86 stack alignment to > 16 byte. People can use 4 byte alignment

RE: x86 Linux stack alignment requirement

2006-06-07 Thread Menezes, Evandro
> > > > > We have several choices for stack alignment requirement > > > > > > > > > > 1. Leave it unchanged. Gcc can do > > > > > a. Nothing. Let the program crash. > > > > > b. Align stack to 16byte if XMM registers are > used locally and > > > > >aren't passed down as fu

RE: x86 Linux stack alignment requirement

2006-06-07 Thread Menezes, Evandro
> > > We have several choices for stack alignment requirement > > > > > > 1. Leave it unchanged. Gcc can do > > > a. Nothing. Let the program crash. > > > b. Align stack to 16byte if XMM registers are used locally and > > >aren't passed down as function arguments. > > > > Why not

RE: x86 Linux stack alignment requirement

2006-06-07 Thread Menezes, Evandro
HJ, > We have several choices for stack alignment requirement > > 1. Leave it unchanged. Gcc can do > a. Nothing. Let the program crash. > b. Align stack to 16byte if XMM registers are used locally and >aren't passed down as function arguments. Why not so if the XMM regi

RE: RFC: x86 Linux stack alignment requirement

2006-06-07 Thread Menezes, Evandro
> There are 2 different, but related questions: > > 1. Should Linux require gcc generates 16byte aligned stack? > 2. How should Linux support 4byte alignment code? Independently of Linux, GCC could align the stack at 16 bytes and still be compliant with the psABI. It could also wrap memalign as

RE: GCC 4.1.0 Released

2006-03-07 Thread Menezes, Evandro
Florian, > * H. J. Lu: > > > Here are diffs of SPEC CPU 2K between before and after with gcc 4.1 > > using "-O2 -ffast-math" on Nocona: > > And what about Opterons? IOW, how "generic" is the optimization? The generic code generation should cost a small compromise in performance relative to

RE: pushl vs movl + movl on x86

2005-08-23 Thread Menezes, Evandro
Dan, > Is there a performance difference between the movl + movl and > pushl code sequences? Not in this example, but movl is faster in some circumstances than pushl. A sequence of pushl has an implicit dependency chain on %esp, as it changes after each pushl, whereas a sequence of movl cou

RE: Porposal: Floating-Point Options

2005-06-16 Thread Menezes, Evandro
> If this option makes it into GCC, maybe it could be named > -O3_unsafe. How about the popular -fast? -- ___ Evandro MenezesAMD Austin, TX

RE: Big differences on SpecFP results for gcc and icc

2005-06-13 Thread Menezes, Evandro
Robert, > > I know that these graphs don't show the results of most aggresive > > optimization options for gcc, but that is also the case > with icc (only > > -O2). However, it looks that gcc and icc are not even in the same > > class regarding FP performance. Perhaps there is some critical

RE: Big differences on SpecFP results for gcc and icc

2005-06-13 Thread Menezes, Evandro
Steven, > > An interesting examples are: > > -177.mesa (this is a c test), where icc is almost 40% faster > > It would be interesting to look into this one. A combination of SSE instead of x87, vectorization, vectorized math library, and very good whole-program IPA. -- _

RE: Sine and Cosine Accuracy

2005-05-27 Thread Menezes, Evandro
Scott, > I still maintain that hardware fsin and fcos are valid and > valuable for certain classes of applications, I agree. I've just been trying to demonstrate that your test doesn't check sin and cos accuracies, but that sin^2 + cos^2 == 1. If I had a sin that always returned 1.0 and a c

RE: Sine and Cosine Accuracy

2005-05-27 Thread Menezes, Evandro
Scott, > Actually, it tested every 1.8°, but who wants to be picky. > I've rerun the test overnight at greater resolution, testing every > 0.0018 degress, and saw no change in the result. That's because the error is the same but symmetrical for sin and cos, so that, when you calculate the

RE: Sine and Cosine Accuracy

2005-05-26 Thread Menezes, Evandro
Scott, > > This is not true. Compare results on an x86 systems with > those on an > > x86_64 or ppc. As I said before, shortcuts were taken in x87 that > > sacrificed accuracy for the sake of speed initially and later of > > compatibility. > > It *is* true for the case where the argument is

RE: Sine and Cosine Accuracy

2005-05-26 Thread Menezes, Evandro
Scott, > For a wide variety of applications, the hardware intrinsics > provide both faster and more accurate results, when compared > to the library functions. This is not true. Compare results on an x86 systems with those on an x86_64 or ppc. As I said before, shortcuts were taken in x87 t

RE: Sine and Cosine Accuracy

2005-05-26 Thread Menezes, Evandro
Uros, > However, the argument to fsin can be reduced to an > acceptable range by using fmod builtin. Internally, this > builtin is implemented as a very tight loop that check for > insufficient reduction, and could reduce whatever finite > value one wishes. Keep in mind that x87 transcende

RE: GCC and Floating-Point

2005-05-25 Thread Menezes, Evandro
Uros, > > Actually, in many cases, SSE did help x86 performance as > well. That > > happens in FP-intensive applications which spend a lot of time in > > loops when the XMM register set can be used more > efficiently than the x87 stack. > > This code could be a perfect example how XMM reg

RE: GCC and Floating-Point

2005-05-25 Thread Menezes, Evandro
Hi, Uros. > Due to outdated i386 ABI, where all FP parameters are > passed on stack, SSE code does not show all its power when > used. When math library function is called, SSE regs are > pushed on stack and called math library function (that is > currently implemented again with i387 insns) p