Re: [fpc-devel] Kit's ambitions!

2018-05-16 Thread J. Gareth Moreton
 Unless I'm mistaken, Wolf, you cannot inline procedures that have asm
blocks appearing anywhere (this includes the entire procedure). 
Nevertheless, does the disassembly of your program show it to be inlined?

 Gareth aka. Kit
 ___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Kit's ambitions!

2018-05-16 Thread Wolf



On 14/05/2018 04:30, David Pethes wrote:

Hi,
I would welcome inlining of (simple) asm routines.

I do not know what you consider to be the existing obstacles to inlining 
assembler routines. What I do know is that in the attached program, 
inlining does work. It summarises my (current) understanding of how to 
measure time with nanosecond reliability
(asking for time via the Linux function "if 
clock_gettime(CLOCK_MONOTONIC, @ts)=0 then" does indeed return 
nanoseconds, but takes some 270 ns (or about 1000 clock ticks) to 
execute and thus does not produce nanosecond reliability)
but repeated measurements do not produce the same output, and therefore 
my little program does not have the reliability I want. Statistical 
processing does something to improve the situation, but not quite what I 
want.


What I can say about inlining assembler routines is this: if the 
variables onto which registers are to be saved are on the stack, they 
can be inlined. Never mind the hints in Lazarus' message pane. Take the

/function GetProcessorUsed: longint;    inline;//
//var//
//  ProcUsed: longint;//
//begin//
//  asm//
//    CPUID//
//    .byte 0x0F, 0x01, 0xF9  // read the Time-Stamp Counter rdtscp 
(as op-code format),//
//    movl %ecx, ProcUsed      // This is the processor on which 
measurements take place. Measurements on other processors are discarded.//

//  end  ['eax','ebx','ecx','edx'];//
//  GetProcessorUsed:=ProcUsed;//
//end;/
Because /ProcUsed/ is on the stack, I can move %ecx into it. But I 
cannot get %ecx directly into /GetProcessorUsed/. That requires a 
separate line of code.


wolf

Here is the full code, as promised. If anybody has a suggestion on how 
to improve it, please let me know, in a separate thread.


/program Speed_Test;
{$ASMMODE att}

uses sysutils, Linux, math;
type
  TtscCount = record
  Group: longint;
  Count: longint;
  CumFreq: Int64;
  end;
type
  TCumFreq = record
  Group: longint;
  CumFreq: real;
  end;
  TCumFrequency= array of TCumFreq;
  TTimeSpec = record
    tv_sec: int64;  //time_t;    //Seconds
    tv_nsec: int64; //clong; //Nanoseconds
  end;
var
  TscCount: array of TtscCount;
  Measured: TCumFrequency;
  MeasurementsToDo: int64=100;
  ProcessorUsed: LongInt;
  Range: array[0..] of longint;
  ValidMeasurements: Int64;

function Get_ClockFreq(CPU: Char): real;
{Since there is no way I can find to extract actual clock frequency, I 
read it from /proc/cpuinfo }

var
  FileHandle: LongInt;
  i: integer;
  Data: ansistring;
  rc:real;
  NumRead: int64;
  Buffer : packed array[0..4095] of char;
  SourceFile: AnsiString= '/proc/cpuinfo';
begin
  if not FileExists(SourceFile) then
  begin
    writeln('Error: Input file "',SourceFile,'" has not been found');
    halt;
  end;
  FileHandle:=FileOpen('/proc/cpuinfo',fmOpenRead);
  NumRead:=FileRead(FileHandle, Buffer,SizeOf(Buffer));
  Data:=Buffer[0..NumRead];
  i:=0;
  while i<=NumRead do
  begin
    inc(i);
    if CompareText(Data[i..i+8],'Processor')=0 then
    begin
  if char(Data[i+12])=CPU then
  begin
    i:=i+12;
    repeat inc(i); until CompareText(Data[i..i+6],'cpu MHz')=0 ;
    try
  rc:=StrToFloat(Data[i+11..i+18]);
    except
    on E : exception do
  begin
    writeln('Data read error: cannot convert 
',Data[i+11..i+18],' into number');

    writeln('Program aborted');
    halt;
  end;
    end;
    break;
  end;
    end;
  end;
  FileClose(FileHandle);
  Get_ClockFreq:=rc;
end;

procedure ReadProcessorFrequencyInformationLeaf;  inline;
var
  CPUID_16H_AX: Word;  // Processor Base Frequency (in MHz)
  CPUID_16H_BX: Word;  // Maximum Frequency (in MHz)
  CPUID_16H_CX: Word;  // Bus (Reference) frequency (in MHz)
  CPUID_16H_DX: Word;  // Reserved = 0
begin
  CPUID_16H_AX:=0;
  CPUID_16H_BX:=0;
  CPUID_16H_CX:=0;
  asm
    mov $0x16, %eax   // select Processor Frequency 
Information Leaf 0x16

    cpuid // access it
    mov %ax, CPUID_16H_AX // Processor Base Frequency (in MHz)
    mov %bx, CPUID_16H_BX // Maximum Frequency (in MHz)
    mov %cx, CPUID_16H_CX // Bus (Reference) frequency (in MHz)
    mov %dx, CPUID_16H_DX  // Reserved = 0
  end  ['ax','bx','cx','dx'];
end;

function GetProcessorUsed: longint;    inline;
var
  ProcUsed: longint;
begin
  asm
    CPUID
    .byte 0x0F, 0x01, 0xF9  // read the Time-Stamp Counter rdtscp 
(as op-code format),
    movl %ecx, ProcUsed    // This is the processor on which 
measurements take place. Measurements on other processors are discarded.

  end  ['eax','ebx','ecx','edx'];
  GetProcessorUsed:=ProcUsed;
end;

procedure MeasureCode;
var
  ts: TTimeSpec;
  MilliSecondTime: extended;
  AX, BX, CX: Word;
  Start,Stop,i,k,l: int64;   // saves starting value from the Time 
Stamp counter

  Hi: int64;
  x:real;
  y: real=2;
  ProcessorUsed_Start, 

Re: [fpc-devel] Debugging Loop Unroll Optimization

2018-05-16 Thread Florian Klämpfl
Am 16.05.2018 um 14:57 schrieb Martok:
> Hi all,
> 
> as we have discovered in 0033576 and 0033614, there is a bug somewhere in the
> loop unroll optimization pass that only appears in complex enough code. The
> problem is, it doesn't happen in the single testsuite test, and the observable
> crashes in Lazarus are caused by memory corruption at some point unrelated to
> the crash, so it's hard to debug (at least on Windows, without rr...).
> 
> I have one other project that has sporadic crashes with -OoLOOPUNROLL (I wish 
> I
> had figured that out back then), but that is about as difficult as Lazarus,
> where it's at least 100% reproducible.
> 
> Does anyone have more complete test cases, or maybe smaller affected projects?
> 

How big is the project? Normally, the number of unrolled loops is reasonable 
small so comparing the
generated assembler with and without unrolling should be feasible.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


[fpc-devel] Debugging Loop Unroll Optimization

2018-05-16 Thread Martok
Hi all,

as we have discovered in 0033576 and 0033614, there is a bug somewhere in the
loop unroll optimization pass that only appears in complex enough code. The
problem is, it doesn't happen in the single testsuite test, and the observable
crashes in Lazarus are caused by memory corruption at some point unrelated to
the crash, so it's hard to debug (at least on Windows, without rr...).

I have one other project that has sporadic crashes with -OoLOOPUNROLL (I wish I
had figured that out back then), but that is about as difficult as Lazarus,
where it's at least 100% reproducible.

Does anyone have more complete test cases, or maybe smaller affected projects?

-- 
Regards,
Martok

Ceterum censeo b32079 esse sanandam.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Kit's ambitions!

2018-05-16 Thread David Pethes
Hi,
I would welcome inlining of (simple) asm routines. Lately I wanted to
use the BEXTR instruction to speed up some inlined bit reading
functions. As there's no intrinsic for it and including even a simple
assembly method disables inlining, it didn't go well.

As for using a BEXTR intrinsic instead: I'd like to try to add it, if
it's welcomed. Judging by searching for POPCNT it shouldn't be that much
work, but I'm likely to miss something - any advice is welcomed.
There's at least one catch that I know of - there's no CPU target that
supports BMI1 but not BMI2 (there are several such AMD cpu-s), so it
should be added as well.


David

On 13. 5. 2018 4:28, J. Gareth Moreton wrote:

> - Research possibility for 'inline' support for certain assembler routines.
> 
> For situations where speed is of the highest priority, there are some
> internal functions such as Int and Frac that can theoretically be
> inlined (a procedure call is quite expensive, around 50 cycles), but
> because they are written in pure assembly language, the compiler will
> never inline them.  I'm still working out quite a bit of theory, but I
> believe I will be able to allow the inlining of routines that are leaf
> functions (don't have CALLs of their own) and declared as
> 'nostackframe'.  Such a system would allow the support of 'intrinsics'
> that can be composed programmatically rather than as internal routines,
> though it's not exactly what Florian is planning. Even if Florian does
> go for a different approach for intrinsics, I like to think that such
> inline support will have uses elsewhere, especially some of the routines
> like "GetStackFrame" (I think) that simply return the value of RSP (if
> it's 'inline', which it is actually declared as in the unit, the return
> value will be far more accurate).
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel