On 19/05/17 02:54, Ryan Joseph wrote:
On May 18, 2017, at 10:40 PM, Jon Foster<jon-li...@jfpossibilities.com>  wrote:

62.44      1.33     1.33                             fpc_frac_real
26.76      1.90     0.57 MATH_$$_FLOOR$EXTENDED$$LONGINT
10.33      2.12     0.22                             FPC_DIV_INT64
Thanks for profiling this.

Floor is there as I expected and 26% is pretty extreme but the others are 
floating point division?
How does Java handle this so much better than FPC and what are the work arounds?
The Pascal test program that was benchmarked here contains a number of bugs/wrong translations from the C code (some stem from the original version, another one was added): 1) casting a floating point number to an int in C does not round, but truncates (I think this may have been mentioned earlier in the thread, I didn't read everything) 2) The usage of floor in the test program is wrong. C's floor takes a floating point number and returns one. The math unit's floor function takes a floating point number and returns an integer. In the Pascal version, this integer is then converted back to a floating point number because the rest of that expression also uses floating point. 3) The Pascal version uses longword instead of int32 for a number of variables (that are "int" in the C version). This results in one expression getting evaluated as 64 bit on 32 bit systems, which is where the FPC_DIV_INT64 calls come from (that's a routine to perform 64 bit *integer* divisions on 32 bit platforms) 4) frac() is only used to get a monotonous increasing value as part of the data input for the test program. The C code (and original Pascal version) uses a tick count and multiplies/divides that, which is much faster.

Then, there's one thing that can be done to optimize the Pascal version (after removing the bugs above): 1) Compile with SSE3 or higher, in particular because SSE3 can be used to implement trunc() with a single instruction (otherwise we pass via a helper that uses the x87 fpu, which moreover has to reconfigure it to change the rounding more and restore it afterwards). However, there does seem to be a bug in FPC 3.0.2 whereby compiling this program for -O2 -Cfsse3 causes it to crash, because then it loads data from an 8-byte aligned location on the stack. It works fine when compiled with trunk and -O2 -Cfsse3 though (at least for 64 bit).

There's at least one minor twist of the classic "C compiler evaluates constant stuff at compile time": 1) oy and oz are constant. The "floor" function is a standard C library function, and hence C compilers know what it does and can evaluate it at compile time. Therefore, the oy-floor(oy) and oz-floor(oz) expressions are (equal) constants for C compilers.

Finally, there are two things FPC definitely is missing:
1) an SSE version of the int() function (which is the basis of a floating point version of floor()) (fairly specific to this program) 2) SSA support in loops (to make better use of SSE registers; related to Florian's note about the calling conventions). However, without the previous changes, even FPC code compiled to LLVM IR and then compiled to machine code with Clang (and hence with full SSA support) results in even worse performance than the code directly compiled with FPC.

There are definitely more things (as I did not manage to get FPC's LLVM IR to compile to a version that's equally fast as the LLVM IR generated from the C program), but I already spent more time than is reasonable on this. I hope the "the sky is falling" comments will stop though.

In summary, as has been mentioned by several people in this thread: you (not directed have to you personally, Ryan) always have to check where your program's slowness comes from, otherwise your test/benchmark is worse than useless (because it just creates confusion, and wastes other people's time when they get tired of mailing list getting flooded by the same information-less statements over and over again).

Also in summary, very little was learned from this. We have known for a long time that FPC needs SSA for better code generation for loops (and Florian has been working on it for a long time too).


Jonas
_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Reply via email to