Re: gomp slowness

2007-11-03 Thread Jakub Jelinek
On Fri, Nov 02, 2007 at 11:09:33PM -0700, Ian Lance Taylor wrote: skaller [EMAIL PROTECTED] writes: As I said before, the register is only stolen for code which actually uses TLS. So scanning that document, for x86_64, fs is used in startup code, presumably if, and only if, there

Re: gomp slowness

2007-11-03 Thread Sylvain Pion
skaller wrote : I can tell you I definitely considered using FS for the Felix thread frame pointer to save passing that pointer between every function.. But then, won't you end up with an implementation very similar to __thread?? -- Sylvain Pion INRIA Sophia-Antipolis Geometrica Project-Team

Re: gomp slowness

2007-11-03 Thread skaller
On Sat, 2007-11-03 at 10:35 +0100, Sylvain Pion wrote: skaller wrote : I can tell you I definitely considered using FS for the Felix thread frame pointer to save passing that pointer between every function.. But then, won't you end up with an implementation very similar to __thread??

Re: gomp slowness

2007-11-02 Thread Daniel Jacobowitz
On Fri, Nov 02, 2007 at 07:39:33AM -0700, Ian Lance Taylor wrote: The only way I can interpret your comments is that you are assuming that all TLS is Global Dynamic (e.g., accessed from a dlopen'ed shared library). But stack based thread local storage won't work for dlopen'ed shared libraries

Re: gomp slowness

2007-11-02 Thread Ian Lance Taylor
skaller [EMAIL PROTECTED] writes: A really cool (non-Posix) implementation would put TLS globals on the stack base .. but this does require at least one extra machine register in languages like C which don't provide a static display (pointer to parent function). For languages that do, such

Re: gomp slowness

2007-11-02 Thread skaller
On Fri, 2007-11-02 at 07:39 -0700, Ian Lance Taylor wrote: skaller [EMAIL PROTECTED] writes: In a C executable, TLS requires one extra machine register. You mean gcc? TLS variables are accessed via offsets from that register. So what's the significant difference between that and your

Re: gomp slowness

2007-11-02 Thread skaller
On Fri, 2007-11-02 at 10:46 -0400, Daniel Jacobowitz wrote: On Fri, Nov 02, 2007 at 07:39:33AM -0700, Ian Lance Taylor wrote: The only way I can interpret your comments is that you are assuming that all TLS is Global Dynamic (e.g., accessed from a dlopen'ed shared library). But stack

Re: gomp slowness

2007-11-02 Thread skaller
On Thu, 2007-11-01 at 21:02 -0700, Gary Funck wrote: On Thu, Oct 18, 2007 at 11:42:52AM +1000, skaller wrote: DO you know how thread local variables are handled? [Not using Posix TLS I hope .. that would be a disaster] Would you please elaborate? Sure .. What's wrong with the

Re: gomp slowness

2007-11-02 Thread Ian Lance Taylor
skaller [EMAIL PROTECTED] writes: On Fri, 2007-11-02 at 07:39 -0700, Ian Lance Taylor wrote: skaller [EMAIL PROTECTED] writes: In a C executable, TLS requires one extra machine register. You mean gcc? I don't understand the question. I mean in a C/C++ executable which uses TLS. By

Re: gomp slowness

2007-11-02 Thread Olivier Galibert
On Sat, Nov 03, 2007 at 03:31:14AM +1100, skaller wrote: On Fri, 2007-11-02 at 07:39 -0700, Ian Lance Taylor wrote: I think you need to look at the TLS access code before deciding that it has bad performance. You already said it costs a register? That's a REALLY high cost to pay to

Re: gomp slowness

2007-11-02 Thread Olivier Galibert
On Sat, Nov 03, 2007 at 03:38:51AM +1100, skaller wrote: My argument is basically: there is no need for any such feature in a well written program. Each thread already has its own local stack. Global variables should not be used in the first place (except for signals etc where there is no

Re: gomp slowness

2007-11-02 Thread Robert Dewar
Olivier Galibert wrote: On Sat, Nov 03, 2007 at 03:38:51AM +1100, skaller wrote: My argument is basically: there is no need for any such feature in a well written program. Each thread already has its own local stack. Global variables should not be used in the first place (except for signals etc

Re: gomp slowness

2007-11-02 Thread Joel Dice
On Sat, 3 Nov 2007, skaller wrote: On Fri, 2007-11-02 at 10:46 -0400, Daniel Jacobowitz wrote: On Fri, Nov 02, 2007 at 07:39:33AM -0700, Ian Lance Taylor wrote: The only way I can interpret your comments is that you are assuming that all TLS is Global Dynamic (e.g., accessed from a dlopen'ed

Re: gomp slowness

2007-11-02 Thread skaller
On Fri, 2007-11-02 at 10:29 -0700, Ian Lance Taylor wrote: skaller [EMAIL PROTECTED] writes: On Fri, 2007-11-02 at 07:39 -0700, Ian Lance Taylor wrote: skaller [EMAIL PROTECTED] writes: In a C executable, TLS requires one extra machine register. You mean gcc? I don't

Re: gomp slowness

2007-11-02 Thread skaller
On Fri, 2007-11-02 at 19:56 +0100, Olivier Galibert wrote: On Sat, Nov 03, 2007 at 03:31:14AM +1100, skaller wrote: On Fri, 2007-11-02 at 07:39 -0700, Ian Lance Taylor wrote: I think you need to look at the TLS access code before deciding that it has bad performance. You already

Re: gomp slowness

2007-11-02 Thread skaller
On Fri, 2007-11-02 at 20:00 +0100, Olivier Galibert wrote: On Sat, Nov 03, 2007 at 03:38:51AM +1100, skaller wrote: My argument is basically: there is no need for any such feature in a well written program. Each thread already has its own local stack. Global variables should not be used

Re: gomp slowness

2007-11-02 Thread Andrew Pinski
This is not true. If you use a register for any purpose like this, it can't be used for anything else and that has a cost. This is a segment register. Please go and read about what segment registers. They are not real registers and cannot be used for anything except memory accesses. They

Re: gomp slowness

2007-11-02 Thread skaller
On Fri, 2007-11-02 at 15:31 -0400, Robert Dewar wrote: Olivier Galibert wrote: There are lots of cases where global thread specific variables are useful in practice, ask anyone who has programmed real world large scale real time embedded programs. No. And I have done just that myself. There

Re: gomp slowness

2007-11-02 Thread Ian Lance Taylor
skaller [EMAIL PROTECTED] writes: On Fri, 2007-11-02 at 10:29 -0700, Ian Lance Taylor wrote: skaller [EMAIL PROTECTED] writes: On Fri, 2007-11-02 at 07:39 -0700, Ian Lance Taylor wrote: skaller [EMAIL PROTECTED] writes: In a C executable, TLS requires one extra machine

Re: gomp slowness

2007-11-02 Thread skaller
On Sat, 2007-11-03 at 12:27 +1100, skaller wrote: On Fri, 2007-11-02 at 10:29 -0700, Ian Lance Taylor wrote: Of course there is. It's called design by contract. I do it all the time. I am appalled at code bases like GTK and interfaces like OpenMP which get such really basic things wrong.

Re: gomp slowness

2007-11-02 Thread skaller
On Fri, 2007-11-02 at 18:45 -0700, Andrew Pinski wrote: This is not true. If you use a register for any purpose like this, it can't be used for anything else and that has a cost. This is a segment register. Please go and read about what segment registers. I know how the x86 works quite

Re: gomp slowness

2007-11-02 Thread Robert Dewar
skaller wrote: This is not true. If you use a register for any purpose like this, it can't be used for anything else and that has a cost. On x86_64 which I use, every register is valuable. Don't you dare take one away, it would have a serious performance impact AND it would stop ME using that

Re: gomp slowness

2007-11-02 Thread Ian Lance Taylor
skaller [EMAIL PROTECTED] writes: Neko, for example, uses a register. AFAIK MLton does the same kind of thing. If gcc team thinks ANY register is free to steal they'd be wrong -- that doesn't mean it shouldn't be used, just that it definitely is NOT free. To be clear, it is not the gcc team

Re: gomp slowness

2007-11-02 Thread skaller
On Fri, 2007-11-02 at 23:56 -0400, Robert Dewar wrote: skaller wrote: You really can't be serious in your comment about fs, if you understand the architecture ... You're just not thinking the same way I am. A CPU has state, the compiler and application program manage that state. If the

Re: gomp slowness

2007-11-02 Thread skaller
On Fri, 2007-11-02 at 22:35 -0700, Ian Lance Taylor wrote: skaller [EMAIL PROTECTED] writes: Neko, for example, uses a register. AFAIK MLton does the same kind of thing. If gcc team thinks ANY register is free to steal they'd be wrong -- that doesn't mean it shouldn't be used, just

Re: gomp slowness

2007-11-02 Thread Ian Lance Taylor
skaller [EMAIL PROTECTED] writes: As I said before, the register is only stolen for code which actually uses TLS. So scanning that document, for x86_64, fs is used in startup code, presumably if, and only if, there is a linker section containing __thread variables? Yes. Ian

Re: gomp slowness

2007-11-01 Thread Gary Funck
On Thu, Oct 18, 2007 at 11:42:52AM +1000, skaller wrote: DO you know how thread local variables are handled? [Not using Posix TLS I hope .. that would be a disaster] Would you please elaborate? What's wrong with the POSIX TLS implementation? Do you know of any studies? I ask, because we

Re: gomp slowness

2007-10-20 Thread Tomash Brechko
I'm not sure what OpenMP spec says about default data scope (too lazy to read through), but it seems that examples from http://kallipolis.com/openmp/2.html assume default(private), while GCC GOMP defaults to shared. In your case, #pragma omp parallel for shared(A, row, col) for (i = k+1;

Re: gomp slowness

2007-10-20 Thread skaller
On Sat, 2007-10-20 at 22:32 +0400, Tomash Brechko wrote: I'm not sure what OpenMP spec says about default data scope (too lazy to read through), but it seems that examples from http://kallipolis.com/openmp/2.html assume default(private), while GCC GOMP defaults to shared. In your case,

Re: gomp slowness

2007-10-18 Thread Biplab Kumar Modak
skaller wrote: OK, attached. Hi skaller, I think I've wasted my money. They do not ship OpenMP headers and libs with Standard Edition. :( Best Regards, Biplab

Re: gomp slowness

2007-10-18 Thread Biplab Kumar Modak
Hi All, I did some tests with GCC-4.2.2 (MinGW build) and the source code provided by skaller. The compilation log is as follows. -- Build: Release in Test --- [ 50.0%] mingw32-gcc.exe -Wall -fexceptions -fopenmp -O2 -IC:\MinGW\include -c

Re: gomp slowness

2007-10-18 Thread Jakub Jelinek
On Thu, Oct 18, 2007 at 02:47:44PM +1000, skaller wrote: On Thu, 2007-10-18 at 12:02 +0800, Biplab Kumar Modak wrote: skaller wrote: On Wed, 2007-10-17 at 18:14 +0100, Biagio Lucini wrote: skaller wrote: It would be interesting to try with another compiler. Do you have access

Re: gomp slowness

2007-10-18 Thread Tim Prince
skaller wrote: On Thu, 2007-10-18 at 12:02 +0800, Biplab Kumar Modak wrote: skaller wrote: On Wed, 2007-10-17 at 18:14 +0100, Biagio Lucini wrote: skaller wrote: It would be interesting to try with another compiler. Do you have access to another OpenMP-enabled

Re: gomp slowness

2007-10-18 Thread skaller
On Thu, 2007-10-18 at 06:00 -0700, Tim Prince wrote: skaller wrote: I don't know of any OpenMP compiler which would correct the nesting of parallel loops in your LU. I have assumed that OpenMP doesn't allow such optimization; you have to get it right yourself. Can you explain? This code

Re: gomp slowness

2007-10-18 Thread skaller
On Thu, 2007-10-18 at 13:04 +0200, Jakub Jelinek wrote: On Thu, Oct 18, 2007 at 02:47:44PM +1000, skaller wrote: On LU_mp.c according to oprofile more than 95% of time is spent in the inner loop, rather than any kind of waiting. On quad core with OMP_NUM_THREADS=4 all 4 threads eat 99.9% of

Re: gomp slowness

2007-10-18 Thread tim prince
skaller wrote: On Thu, 2007-10-18 at 06:00 -0700, Tim Prince wrote: skaller wrote: I don't know of any OpenMP compiler which would correct the nesting of parallel loops in your LU. I have assumed that OpenMP doesn't allow such optimization; you have to get it right yourself.

RE: gomp slowness

2007-10-18 Thread Dave Korn
On 19 October 2007 02:45, tim prince wrote: skaller wrote: On Thu, 2007-10-18 at 06:00 -0700, Tim Prince wrote: skaller wrote: I don't know of any OpenMP compiler which would correct the nesting of parallel loops in your LU. I have assumed that OpenMP doesn't allow such

gomp slowness

2007-10-17 Thread skaller
Hi, I have just run and timed a couple of tutorial examples for openMP using gcc (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4) on a dual core Athlon amd64, with OMP_NUM_THREADS set to 1 and 2, and occasionally 8 I found that 1 thread outperforms 2 by almost 2:1 on all the examples, and 8 is only

Re: gomp slowness

2007-10-17 Thread Joe Buck
On Thu, Oct 18, 2007 at 03:00:02AM +1000, skaller wrote: Hi, I have just run and timed a couple of tutorial examples for openMP using gcc (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4) on a dual core Athlon amd64, with OMP_NUM_THREADS set to 1 and 2, and occasionally 8 I found that 1 thread outperforms

Re: gomp slowness

2007-10-17 Thread Biagio Lucini
skaller wrote: Hi, I have just run and timed a couple of tutorial examples for openMP using gcc (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4) on a dual core Athlon amd64, with OMP_NUM_THREADS set to 1 and 2, and occasionally 8 I found that 1 thread outperforms 2 by almost 2:1 on all the examples, and 8

Re: gomp slowness

2007-10-17 Thread skaller
On Wed, 2007-10-17 at 18:14 +0100, Biagio Lucini wrote: skaller wrote: It would be interesting to try with another compiler. Do you have access to another OpenMP-enabled compiler? Unfortunately no, unless MSVC++ in VS2005 has openMP. I have an Intel licence but they're too tied up with

Re: gomp slowness

2007-10-17 Thread skaller
On Wed, 2007-10-17 at 10:09 -0700, Joe Buck wrote: On Thu, Oct 18, 2007 at 03:00:02AM +1000, skaller wrote: Hi, I have just run and timed a couple of tutorial examples for openMP using gcc (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4) on a dual core Athlon amd64, with OMP_NUM_THREADS set to 1 and 2,

Re: gomp slowness

2007-10-17 Thread Ross Ridge
skaller writes: Unfortunately no, unless MSVC++ in VS2005 has openMP. I don't know if Visual C++ 2005 Express supports OpenMP, but the Professional edition should. Alternatively, the free, as in beer, Microsoft compiler included in the Windows SDK supports OpenMP.

Re: gomp slowness

2007-10-17 Thread skaller
On Wed, 2007-10-17 at 10:09 -0700, Joe Buck wrote: On Thu, Oct 18, 2007 at 03:00:02AM +1000, skaller wrote: Hi, I have just run and timed a couple of tutorial examples for openMP using gcc (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4) on a dual core Athlon amd64, with OMP_NUM_THREADS set to 1 and 2,

Re: gomp slowness

2007-10-17 Thread Biplab Kumar Modak
Ross Ridge wrote: skaller writes: Unfortunately no, unless MSVC++ in VS2005 has openMP. I don't know if Visual C++ 2005 Express supports OpenMP, but the Professional edition should. Alternatively, the free, as in beer, Microsoft compiler included in the Windows SDK supports OpenMP. Visual

Re: gomp slowness

2007-10-17 Thread Biplab Kumar Modak
skaller wrote: On Wed, 2007-10-17 at 18:14 +0100, Biagio Lucini wrote: skaller wrote: It would be interesting to try with another compiler. Do you have access to another OpenMP-enabled compiler? Unfortunately no, unless MSVC++ in VS2005 has openMP. I have an Intel licence but they're too

Re: gomp slowness

2007-10-17 Thread skaller
On Thu, 2007-10-18 at 12:02 +0800, Biplab Kumar Modak wrote: skaller wrote: On Wed, 2007-10-17 at 18:14 +0100, Biagio Lucini wrote: skaller wrote: It would be interesting to try with another compiler. Do you have access to another OpenMP-enabled compiler? Unfortunately no,