subject:"tooling quality and some random rant"

Re: tooling quality and some random rant

2011-02-23 Thread Bruno Medeiros


On 14/02/2011 12:37, Jacob Carlborg wrote:

On 2011-02-13 16:07, Gary Whatmore wrote:

Paulo Pinto Wrote:


"Nick Sabalausky" wrote in message
news:ij7v76$1q4t$1...@digitalmars.com...

... (cutted) ...

That's not the compiler, that's the linker. I don't know what linker
DMD
uses on OSX, but on Windows it uses OPTLINK which is written in
hand-optimized Asm so it's really hard to change. But Walter's been
converting it to C (and maybe then to D once that's done) bit-by-bit
(so
to speak), so linker improvements are at least on the horizon.

...


Why C and not directly D?

It is really bad adversting for D to know that when its creator came
around
to
rewrite the linker, Walter decided to use C instead of D.


I'm guessing that Walter feels more familiar and comfortable
developing C/C++ instead of D. He's the creator of D, but has written
very small amounts of D and probably cannot write idiomatic D very
fluently. Another issue is the immature toolchain.

This might sound like blasphemy, but I believe the skills and
knowledge for developing large scale applications in language XYZ
cannot be extrapolated from small code snippets or from experience
with projects in other languages. You just need to eat your own
dogfood and get your feet wet by doing.

People like the Tango's 'kris' and this 'h3r3tic' are the real world D
experts. Sadly they've all left D. We need a new generation of
experts, because these old guys ranting about every issue are more
harmful than good to the community.


Kris is still around.



Out of curiosity, what do you mean "still around". Still working with D?

--
Bruno Medeiros - Software Engineer

Re: tooling quality and some random rant

2011-02-23 Thread Bruno Medeiros


On 13/02/2011 23:28, retard wrote:

Sun, 13 Feb 2011 15:06:46 -0800, Brad Roberts wrote:


On 2/13/2011 3:01 PM, Walter Bright wrote:

Michel Fortin wrote:

But note I was replying to your reply to Denis who asked specifically
for demangled names for missing symbols. This by itself would be a
useful improvement.


I agree with that, but there's a caveat. I did such a thing years ago
for C++ and Optlink. Nobody cared, including the people who asked for
that feature. It's a bit demotivating to bother doing that again.


No offense, but this argument gets kinda old and it's incredibly weak.

Today's tooling expectations are higher.  The audience isn't the same.
And clearly people are asking for it.  Even the past version of it I
highly doubt no one cared, you just didn't hear from those that liked
it.  After all, few people go out of their way to talk about what they
like, just what they don't.


Half of the readers have already added me to their killfile, but here
goes some on-topic humor:

http://www.winandmac.com/wp-content/uploads/2010/03/ipad-hp-fail.jpg



The only fail here is that comparison

--
Bruno Medeiros - Software Engineer

Re: tooling quality and some random rant

2011-02-20 Thread Walter Bright


nedbrek wrote:

Hope that helps,


Thanks, this is great info!

Re: tooling quality and some random rant

2011-02-19 Thread nedbrek


"distcc"  wrote in message news:ijp9ji$1hvd$1...@digitalmars.com...
> nedbrek Wrote:
>> "Walter Bright"  wrote in message
>> news:ijnt3o$22dm$1...@digitalmars.com...
>>> nedbrek wrote:
 Also, "macro op fusion" allows you can get a branch along with the last
 instruction in decode, potentially giving you 5 macroinstructions per
 cycle from decode.  Make sure it is the flags producing instruction
 (cmp-br).

>>>
>>> I can't find any Intel documentation on this. Can you point me to some?
>>
>> The best available source is the optimization reference manual
>> (http://www.intel.com/products/processor/manuals/).  The latest version 
>> is
>> 248966.pdf, which mentions "Decodes up to four instructions, or up to 
>> five
>> with macro-fusion" (page 33).  Also, page 36: "Macro-fusion merges two
>> instructions into a single ?op. Intel Core microarchitecture is capable 
>> of
>> one macro-fusion per cycle in 32-bit operation".  It's unclear if macro
>> fusion is off entirely in 64 bit mode, and whether this has changed in 
>> more
>> recent processors...
>
> I remember reading that macro fusion is entirely off in 64 bit mode in 
> Nehalem
> and earlier generations, and supported in Sandy Bridge.
>
> When generating code for loops, the compiler could also make use of Loop 
> Stream
> Coder to avoid i-cache misses.

Serves me right, it is a little further in, page 52: "In Intel 
microarchitecture (Nehalem) , macro-fusion is supported in 64-bit mode, and 
the following instruction sequences are supported: (big list)".

That would leave it off of 65nm (Merom) and 45nm (Penryn) parts.  These are 
identifiable through CPUID.

The guide is broken up into sections based on the particular chip, so you 
end up having to read them all to get a general feel for things...

Ned

Re: tooling quality and some random rant

2011-02-19 Thread distcc

nedbrek Wrote:

> Hello,
> 
> "Walter Bright"  wrote in message 
> news:ijnt3o$22dm$1...@digitalmars.com...
> > nedbrek wrote:
> >> Reordering happens in the scheduler. A simple model is "Fetch", 
> >> "Schedule", "Retire".  Fetch and retire are done in program order.  For 
> >> code that is hitting well in the cache, the biggest bottleneck is that 
> >> "4" decoder (the complex instruction decoder).  Reducing the number of 
> >> complex instructions will be a big win here (and settling them into the 
> >> 4-1-1(-1) pattern).
> >>
> >> Of course, on anything after Core 2, the "1" decoders can handle pushes, 
> >> pops, and load-ops (r+=m) (although not load-op-store (m+=r)).
> >>
> >> Also, "macro op fusion" allows you can get a branch along with the last 
> >> instruction in decode, potentially giving you 5 macroinstructions per 
> >> cycle from decode.  Make sure it is the flags producing instruction 
> >> (cmp-br).
> >>
> >
> > I can't find any Intel documentation on this. Can you point me to some?
> 
> The best available source is the optimization reference manual 
> (http://www.intel.com/products/processor/manuals/).  The latest version is 
> 248966.pdf, which mentions "Decodes up to four instructions, or up to five 
> with macro-fusion" (page 33).  Also, page 36: "Macro-fusion merges two 
> instructions into a single ?op. Intel Core microarchitecture is capable of 
> one macro-fusion per cycle in 32-bit operation".  It's unclear if macro 
> fusion is off entirely in 64 bit mode, and whether this has changed in more 
> recent processors...

I remember reading that macro fusion is entirely off in 64 bit mode in Nehalem 
and earlier generations, and supported in Sandy Bridge.

When generating code for loops, the compiler could also make use of Loop Stream 
Coder to avoid i-cache misses.

Re: tooling quality and some random rant

2011-02-19 Thread nedbrek

Hello,

"Walter Bright"  wrote in message 
news:ijnt3o$22dm$1...@digitalmars.com...
> nedbrek wrote:
>> Reordering happens in the scheduler. A simple model is "Fetch", 
>> "Schedule", "Retire".  Fetch and retire are done in program order.  For 
>> code that is hitting well in the cache, the biggest bottleneck is that 
>> "4" decoder (the complex instruction decoder).  Reducing the number of 
>> complex instructions will be a big win here (and settling them into the 
>> 4-1-1(-1) pattern).
>>
>> Of course, on anything after Core 2, the "1" decoders can handle pushes, 
>> pops, and load-ops (r+=m) (although not load-op-store (m+=r)).
>>
>> Also, "macro op fusion" allows you can get a branch along with the last 
>> instruction in decode, potentially giving you 5 macroinstructions per 
>> cycle from decode.  Make sure it is the flags producing instruction 
>> (cmp-br).
>>
>
> I can't find any Intel documentation on this. Can you point me to some?

The best available source is the optimization reference manual 
(http://www.intel.com/products/processor/manuals/).  The latest version is 
248966.pdf, which mentions "Decodes up to four instructions, or up to five 
with macro-fusion" (page 33).  Also, page 36: "Macro-fusion merges two 
instructions into a single ?op. Intel Core microarchitecture is capable of 
one macro-fusion per cycle in 32-bit operation".  It's unclear if macro 
fusion is off entirely in 64 bit mode, and whether this has changed in more 
recent processors...

They recommend against aligning code in general to 4-1-1-1 (also page 36), 
but I'd assume this is for a very targeted application.  As always, it is 
best to run things both ways and measure.

The next section (2.1.2.5) talks about stack pointer tracking - which allows 
macro operations which used to be 2 uops (pop r -> load r = [esp]; inc esp) 
to become one (just the load).  Pushes, which used to be 3 uops 
(store_address esp, store_data r, dec esp) should also be one fused uop (via 
sta/std fusion and store point tracking).

Another good resource is "Real World Tech", particularly:
http://www.realworldtech.com/page.cfm?ArticleID=RWT030906143144

Page 4 covers the front end: "Macro-op fusion lets the decoders combine two 
macro instructions into a single uop. Specifically, x86 compare or test 
instructions are fused with x86 jumps to produce a single uop and any 
decoder can perform this optimization."

Finally, the Intel Technology Journal has some really good details (when you 
can find them! :)

For example:
http://download.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/vol7iss2_art03.pdf

details the original processor to use micro-op fusion (Pentium M or Banias - 
which was the base design for Dothan and Yonah).  See page 26 (epage 7/18) - 
which starts the section "MICRO-OPS FUSION".  It gives a lot of detail of 
the store address / store data fusion.

Hope that helps,
Ned

Re: tooling quality and some random rant

2011-02-18 Thread Walter Bright


nedbrek wrote:
Reordering happens in the scheduler. A simple model is "Fetch", "Schedule", 
"Retire".  Fetch and retire are done in program order.  For code that is 
hitting well in the cache, the biggest bottleneck is that "4" decoder (the 
complex instruction decoder).  Reducing the number of complex instructions 
will be a big win here (and settling them into the 4-1-1(-1) pattern).


Of course, on anything after Core 2, the "1" decoders can handle pushes, 
pops, and load-ops (r+=m) (although not load-op-store (m+=r)).


Also, "macro op fusion" allows you can get a branch along with the last 
instruction in decode, potentially giving you 5 macroinstructions per cycle 
from decode.  Make sure it is the flags producing instruction (cmp-br).


(I used to work for Intel :)


I can't find any Intel documentation on this. Can you point me to some?

Re: tooling quality and some random rant

2011-02-18 Thread nedbrek

Hello all,

"Walter Bright"  wrote in message 
news:ijeih9$2aso$2...@digitalmars.com...
> Don wrote:
>> That would really be fun.
>> BTW, the current Intel processors are basically the same as Pentium Pro, 
>> with a few improvements. The strange thing is, because of all of the 
>> reordering that happens, swapping the order of two (non-dependent) 
>> instructions makes no difference at all. So you always need to look at 
>> every instruction in the a loop, before you can do any scheduling.
>
> I was looking at Agner's document, and it looks like ordering the 
> instructions in the 4-1-1 or 4-1-1-1 for optimal decoding could work. This 
> would fit right in with the way the scheduler works.
>
> I had thought that with the CPU automatically reordering instructions, 
> that scheduling them was obsolete.

Reordering happens in the scheduler. A simple model is "Fetch", "Schedule", 
"Retire".  Fetch and retire are done in program order.  For code that is 
hitting well in the cache, the biggest bottleneck is that "4" decoder (the 
complex instruction decoder).  Reducing the number of complex instructions 
will be a big win here (and settling them into the 4-1-1(-1) pattern).

Of course, on anything after Core 2, the "1" decoders can handle pushes, 
pops, and load-ops (r+=m) (although not load-op-store (m+=r)).

Also, "macro op fusion" allows you can get a branch along with the last 
instruction in decode, potentially giving you 5 macroinstructions per cycle 
from decode.  Make sure it is the flags producing instruction (cmp-br).

(I used to work for Intel :)
Ned

Re: tooling quality and some random rant

2011-02-15 Thread Walter Bright


Don wrote:

Walter Bright wrote:

Don wrote:
In hand-coded asm, instruction scheduling still gives more than half 
of the same benefit that it used to do. But, it's become ten times 
more difficult. You have to use Agner Fog's manuals, not Intel/AMD.


For example:
(1) a common bottleneck on all Intel processors, is that you can only 
read from three registers per cycle, but you can also read from any 
register which has been modified in the last three cycles.

(2) it's important to break dependency chains.

On the BigInt code, instruction scheduling gave a speedup of ~40%.


Wow. I didn't know that. Do any compilers currently schedule this stuff?


Intel probably does. I don't think any others do a very good job. Agner 
told me that he had had no success in getting compiler vendors to be 
interested in his work.


Well, this one is. In fact, could we get Agner to actively help us out with 
this?


Any chance you want to take a look at cgsched.c? I had great success 
using the same algorithm for the quite different Pentium and P6 
scheduling minutia.


That would really be fun.
BTW, the current Intel processors are basically the same as Pentium Pro, 
with a few improvements. The strange thing is, because of all of the 
reordering that happens, swapping the order of two (non-dependent) 
instructions makes no difference at all. So you always need to look at 
every instruction in the a loop, before you can do any scheduling.


I was looking at Agner's document, and it looks like ordering the instructions 
in the 4-1-1 or 4-1-1-1 for optimal decoding could work. This would fit right in 
with the way the scheduler works.


I had thought that with the CPU automatically reordering instructions, that 
scheduling them was obsolete.

Re: tooling quality and some random rant

2011-02-15 Thread Lutger Blijdestijn

retard wrote:

> Mon, 14 Feb 2011 20:10:47 +0100, Lutger Blijdestijn wrote:
> 
>> retard wrote:
>> 
>>> Mon, 14 Feb 2011 04:44:43 +0200, so wrote:
>>> 
> Unfortunately DMC is always out of the question because the
> performance is 10-20 (years) behind competition, fast compilation
> won't help it.
 
 Can you please give a few links on this?
>>> 
>>> What kind of proof you need then? Just take some existing piece of code
>>> with high performance requirements and compile it with dmc. You lose.
>>> 
>>> http://biolpc22.york.ac.uk/wx/wxhatch/wxMSW_Compiler_choice.html
>>> http://permalink.gmane.org/gmane.comp.lang.c++.perfometer/37
>>> http://lists.boost.org/boost-testing/2005/06/1520.php
>>> http://www.digitalmars.com/d/archives/c++/chat/66.html
>>> http://www.drdobbs.com/cpp/184405450
>>> 
>>> 
>> That is ridiculous, have you even bothered to read your own links? In
>> some of them dmc wins, others the differences are minimal and for all of
>> them dmc is king in compilation times.
> 
> DMC doesn't clearly win in any of the tests and these are merely some
> naive examples I found by doing 5 minutes of googling. Seriously, take a
> closer look - the gcc version is over 5 years old. Nobody even bothers
> doing dmc benchmarks anymore, dmc is so out of the league. I repeat, this
> was about performance of the generated binaries, not compile times.
> 
> Like I said: take some existing piece of code with high performance
> requirements and compile it with dmc. You lose. I honestly don't get what
> I need to prove here. Since you have no clue, presumably you aren't even
> using dmc and won't be considering it.

You go on ranting about dmc as if it is dwarfed by other compilers (which it 
might very well be), then provide 'proof' that doesn't prove this at all and 
now I must be convinced that it's because the other compilers are so old? 
You lose. You don't have to prove anything, but when you do, don't do it 
with dubious and inconclusive benchmarks. That's all.

Re: tooling quality and some random rant

2011-02-15 Thread Don

bearophile wrote:

Walter:

Huh, I simply could never find a document about how to use those which gave me any
comfortable sense that the author knew what he was talking about.<

http://www.agner.org/optimize/

Don:

A problem with that, is that the prefetching instructions are vendor-specific.<

Right. Then I suggest some higher-level annotations (pragmas?) that the
programmer uses to better state the temporal semantics of memory accesses in a
performance-critical part of D code.

Also, it's quite difficult to use them correctly. If you put them in the wrong
place, or use them too much, they slow your code down.<

CPU caches have a simple purpose. Light speed is finite (how much distance does light travel in vacuum/doped silicon during a clock cycle of a 5 GHz POWER6 CPU? http://en.wikipedia.org/wiki/POWER6 ), and finding one thing among many things is slower than finding among few ones. So you speed up your memory accesses if you read information from a smaller group of data located closer to you. Most CPUs don't have a little faster memory that you manage yourself (http://en.wikipedia.org/wiki/Scratchpad_RAM ), the CPUs copy data from/to cache levels by themselves, so on such CPUs the illusion of a flat memory is at the hardware level, not just at C language level. Cache manage their memory in few different ways, often bigger CPUs offer ways to alter such ways a little, using special instructions.

The main difference is how they keep coherence across different core
caches and in what situations they store back data from the cache to RAM.

I think you may be confusing prefetch instructions with non-temporal stores.

The problem with prefetch instructions, is that they interfere with the
hardware prefetch mechanism. The hardware prefetch is actually very
good, and it's only under specific circumstances that a manual prefetch
can beat it. I think it's unlikely that you can use prefetching
beneficially, unless you've looked at the generated asm code.

In some cases in your program you want to read from an array, and store data inside it
again and another one too, but you never want to store far away data in the first one.
There are few other common patterns of memory usage. In theory a normal language like
Fortran is enough to specify what memory you want to read or write and when you want to
do it. In practice today compilers are not so good at inferring such semantics, so some
high level annotations probably help. In future maybe compilers will get better, so they
will ignore those annotations, just like they often ignore "register"
annotations. Being system-level programming languages practical things, adding
annotations is not bad, even if 5-10 years later those annotations become less useful.

1 2 >

1 - 100 of 148 matches

Mail list logo