Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code

2013-09-18 Thread Mark Morgan Lloyd

Bart wrote:

On 9/17/13, Mark Morgan Lloyd markmll.laza...@telemetry.co.uk wrote:

Hans-Peter Diettrich wrote:



I've missed part of this thread due to messages dropped at our gateway,
but am currently trying to check on SPARC Linux.


Here's the comple code for you:


Code as posted plus your earlier event handler gives output as below:

Testing: 2-byte codepoint: $C2 $A2
Expected Length = 8
Found Length= 8
Expected: $C2 $A2 $C2 $A2 $C2 $A2 $C2 $A2
Found   : $C2 $A2 $C2 $A2 $C2 $A2 $C2 $A2
Success!

Testing: 3-byte codepoint: $E2 $82 $AC
Expected Length = 12
Found Length= 12
Expected: $E2 $82 $AC $E2 $82 $AC $E2 $82 $AC $E2 $82 $AC
Found   : $E2 $82 $AC $E2 $82 $AC $E2 $82 $AC $E2 $82 $AC
Success!

Testing: 4-byte codepoint: $F0 $A4 $AD $A2
Expected Length = 16
Found Length= 16
Expected: $F0 $A4 $AD $A2 $F0 $A4 $AD $A2 $F0 $A4 $AD $A2 $F0 $A4 $AD $A2
Found   : $F0 $A4 $AD $A2 $F0 $A4 $AD $A2 $F0 $A4 $AD $A2 $F0 $A4 $AD $A2
Success!

--
Mark Morgan Lloyd
markMLl .AT. telemetry.co .DOT. uk

[Opinions above are the author's, not those of his employers or colleagues]

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code

2013-09-18 Thread Mark Morgan Lloyd

José Mejuto wrote:

I had not fully checked you code, but I think you are coding always the 
UTF16 as little endian, and in big endian machines the UTF16 should be 
big endian also.


I do not have a big endian machine in order to perform some tests.


I generally have 32-bit big-endian systems available, but not 
necessarily with the most recent Lazarus and/or FPC so if in doubt it's 
safer to code a completely standalone example (i.e. as Bart did).


--
Mark Morgan Lloyd
markMLl .AT. telemetry.co .DOT. uk

[Opinions above are the author's, not those of his employers or colleagues]

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code

2013-09-18 Thread José Mejuto

El 17/09/2013 17:25, Bart escribió:


I thought f performance when calling like Utf8StringOfChare(AnUtf8Char, 32768),
This should perform better if Fill(D)Word can be used I think.


Hello,

Yes, it should, but you can place a check that if less than 64 bytes are 
used, do not call Fill(D)Word but use plain for..loop to fill the data. 
I'm quite sure that you will get a bit more speed.



I had not fully checked you code, but I think you are coding always the
UTF16 as little endian, and in big endian machines the UTF16 should be
big endian also.


In my coding (in this thread) there is no UTF16...



Yes sorry... Age is killing me :) I saw the shr and the LEToN and 
inmediatly think in UTF16. Sorry.



--


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code

2013-09-18 Thread Bart
On 9/18/13, Mark Morgan Lloyd markmll.laza...@telemetry.co.uk wrote:

 Bart wrote:
 On 9/17/13, Mark Morgan Lloyd markmll.laza...@telemetry.co.uk wrote:
 Hans-Peter Diettrich wrote:

 I've missed part of this thread due to messages dropped at our gateway,
 but am currently trying to check on SPARC Linux.

OK, so it works as expected.
I will commit it then.

Any optimizations can be added later.
Please post them in the bugtracker, and (if have the rights) assign it to me.

Everybody here thanks for the help.

Bart

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code

2013-09-17 Thread Bart
On 9/17/13, Hans-Peter Diettrich drdiettri...@aol.com wrote:

 function Utf8StringOfChar(AUtf8Char: Utf8String; N: Integer): Utf8String;
 var
   UCharLen, i,nb: Integer;
   PC: PChar;
 begin
   Result := '';
   UCharLen := Length(AUtf8Char);
  nb := N*UCharLen;
  if nb = 0 then exit;
   //3:
Result := AUtf8Char;
   SetLength(Result, nb);
  PC := Result;
   for i:=1 to nb-UCharLen do
   begin
 PC[UCharLen] := PC[0] inc(PC);
   end;
 end;


I would think that FiilWord and FillDWord is faster than that.
For 3-byte sequence I almost did the same as you.

More to the point: does it produce correct results on BigEndian.
Theory (as in: as far as I can see and compute in my head) says yes,
but I would like confirmation (on my secons code example).

Bart

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code

2013-09-17 Thread Mark Morgan Lloyd

Hans-Peter Diettrich wrote:

Bart schrieb:


Did you also test the simpler approach, replicating the pattern in one
loop? It's independent of endianness, and can boil down to a single
machine instruction (x86: REP MOVS).


It would be repeating either 2,3, or 4-bytes each time.
How would you code that?


I would not care.


I've missed part of this thread due to messages dropped at our gateway, 
but am currently trying to check on SPARC Linux.


Bart, where's Utf8Length() imported from, of do you have your own 
implementation that's newer than Lazarus 1.0 + FPC 2.6.0?


--
Mark Morgan Lloyd
markMLl .AT. telemetry.co .DOT. uk

[Opinions above are the author's, not those of his employers or colleagues]

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code

2013-09-17 Thread José Mejuto

El 17/09/2013 10:48, Bart escribió:


I would think that FiilWord and FillDWord is faster than that.
For 3-byte sequence I almost did the same as you.


Hello,

In  most architectures (32 and 64 bits at least) memory moves of less 
than 16 bytes are faster done one by one. Just the call to the function 
kills the possible performance gain.



More to the point: does it produce correct results on BigEndian.
Theory (as in: as far as I can see and compute in my head) says yes,
but I would like confirmation (on my secons code example).


I had not fully checked you code, but I think you are coding always the 
UTF16 as little endian, and in big endian machines the UTF16 should be 
big endian also.


I do not have a big endian machine in order to perform some tests.


--

José Mejuto

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code

2013-09-17 Thread Bart
On 9/17/13, José Mejuto joshy...@gmail.com wrote:

 In  most architectures (32 and 64 bits at least) memory moves of less
 than 16 bytes are faster done one by one. Just the call to the function
 kills the possible performance gain.

I thought f performance when calling like Utf8StringOfChare(AnUtf8Char, 32768),
This should perform better if Fill(D)Word can be used I think.


 I had not fully checked you code, but I think you are coding always the
 UTF16 as little endian, and in big endian machines the UTF16 should be
 big endian also.

In my coding (in this thread) there is no UTF16...

Bart

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code

2013-09-17 Thread Bart
On 9/17/13, Mark Morgan Lloyd markmll.laza...@telemetry.co.uk wrote:


 I've missed part of this thread due to messages dropped at our gateway,
 but am currently trying to check on SPARC Linux.

Sh@# happens...


 Bart, where's Utf8Length() imported from, of do you have your own
 implementation that's newer than Lazarus 1.0 + FPC 2.6.0?

No, sorry it's in LazUtf8.
UTF8 isn't that much fun that I wanted to make my own function for that ;-)

Bart

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code

2013-09-17 Thread Bart
On 9/17/13, Hans-Peter Diettrich drdiettri...@aol.com wrote:
 So, I conclude they perform roughly the same.

 Thanks for testing :-)

 Did you only test the 3-byte case, or also for WORD and DWORD cases? I

My FillDWord code is 20 times faster (with N being equal, so
Length(Result) being 1.33 times bigger) faster thean the code used for
the 3-byte sequence.

The FillWord code is 28 times faster than 3-byte code (N-2-byte being
1,5 * N-3-byte, so Length(Result) being equal).

Testing with all results being of equal length:
1 calls to Utf8StringOfChar (12 * 1-byte): 15 ticks.
1 calls to Utf8StringOfChar (6  * 2-byte): 16 ticks.  //FillWord
1 calls to Utf8StringOfChar (4  * 3-byte): 561 ticks.
1 calls to Utf8StringOfChar (3  * 4-byte): 16 ticks. //FillDword

Bart

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code

2013-09-17 Thread Hans-Peter Diettrich

Bart schrieb:

On 9/17/13, Mark Morgan Lloyd markmll.laza...@telemetry.co.uk wrote:

Hans-Peter Diettrich wrote:


HPD code
3:
begin
  Result := AUtf8Char;
  SetLength(Result, nb);
  PC := PChar(Result);
  for i:=1 to nb - UCharLen do
  begin
PC[UCharLen] := PC[0];  //very nice b.t.w.
inc(PC);
  end;
end;

My code

3:
begin
  SetLength(Result, 3 * N);
  C1 := AUtf8Char[1];
  C2 := AUtf8Char[2];
  C3 := AUtf8Char[3];
  PC := PChar(Result);
  for i:=1 to N do
  begin
PC^ := C1; inc(PC);
PC^ := C2; inc(PC);
PC^ := C3; inc(PC);
  end;
end;

I tested that for speed, and though they consistently differ about
2-6%, on further inspection this is dependant on the order the 2 are
executed: the first one is always the slower...
So, I conclude they perform roughly the same.


Thanks for testing :-)

Did you only test the 3-byte case, or also for WORD and DWORD cases? I 
left the 3: only as a comment to the pattern taken from your code, 
while my version works for any length of the AUTF8Char, not only for the 
3-byte case. It's suited to allow for compiler optimizations, so that 
e.g. REP MOVSB or other streaming instructions could be used for the 
entire loop, as available for a target CPU. This feature makes the 
function usable even for replication of entire strings, of any encoding, 
when the parameter and result type is changed to RawByteString.


DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code

2013-09-17 Thread Bart
On 9/17/13, Mark Morgan Lloyd markmll.laza...@telemetry.co.uk wrote:
 Hans-Peter Diettrich wrote:

HPD code
3:
begin
  Result := AUtf8Char;
  SetLength(Result, nb);
  PC := PChar(Result);
  for i:=1 to nb - UCharLen do
  begin
PC[UCharLen] := PC[0];  //very nice b.t.w.
inc(PC);
  end;
end;

My code

3:
begin
  SetLength(Result, 3 * N);
  C1 := AUtf8Char[1];
  C2 := AUtf8Char[2];
  C3 := AUtf8Char[3];
  PC := PChar(Result);
  for i:=1 to N do
  begin
PC^ := C1; inc(PC);
PC^ := C2; inc(PC);
PC^ := C3; inc(PC);
  end;
end;

I tested that for speed, and though they consistently differ about
2-6%, on further inspection this is dependant on the order the 2 are
executed: the first one is always the slower...
So, I conclude they perform roughly the same.

Bart

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code

2013-09-17 Thread Bart
On 9/17/13, Mark Morgan Lloyd markmll.laza...@telemetry.co.uk wrote:
 Hans-Peter Diettrich wrote:

 I've missed part of this thread due to messages dropped at our gateway,
 but am currently trying to check on SPARC Linux.

Here's the comple code for you:

function Utf8StringOfChar(AUtf8Char: Utf8String; N: Integer): Utf8String;
var
  UCharLen, i: Integer;
  C1, C2, C3: Char;
  PC: PChar;
begin
  Result := '';
  if (N = 0) or (Utf8Length(AUtf8Char)  1) then Exit;
  UCharLen := Length(AUtf8Char);
  Case UCharLen of
1: Result := StringOfChar(AUtf8Char[1], N);
2:
begin
  SetLength(Result, 2 * N);
  System.FillWord(Result[1], N, PWord(Pointer(AUtf8Char))^);;
 end;
3:
begin
  SetLength(Result, 3 * N);
  C1 := AUtf8Char[1];
  C2 := AUtf8Char[2];
  C3 := AUtf8Char[3];
  PC := PChar(Result);
  for i:=1 to N do
  begin
PC^ := C1; inc(PC);
PC^ := C2; inc(PC);
PC^ := C3; inc(PC);
  end;
end;
4:
begin
  SetLength(Result, 4 * N);
  System.FillDWord(Result[1], N, PDWord(Pointer(AUtf8Char))^);
end;
else
begin
  //In November 2003 UTF-8 was restricted by RFC 3629 to four bytes to match
  //the constraints of the UTF-16 character encoding.
  //http://en.wikipedia.org/wiki/UTF-8
  Result := StringOfChar('?', N);
end;
  end;
end;

Bart

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


[Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code

2013-09-16 Thread Bart
Hi,

Current code for Utf8StringOfChar that I wrote (in LazUtf8 unit) may
fail due to Utf8 - UTF16 - FillWord - Utf8 conversions, which only
work for UCS2, as Mattias pointed out to me.

I constructed a new Utf8StringOfChar function that builds UTF8 without
conversions.
For speed reasons it uses FillWord or FillDWord when appropriate.

To be sure this is Endian safe (It works AFAICS on Windows i386) I
need Mac users (or testers with other BigEndian architecture) to test
the code:

function Utf8StringOfChar(AUtf8Char: Utf8String; N: Integer): Utf8String;
var
  UCharLen, i: Integer;
  W: Word;
  DW: DWORD;
  C1, C2, C3: Char;
  PC: PChar;
begin
  Result := '';
  if Utf8Length(AUtf8Char)  1 then Exit;
  UCharLen := Length(AUtf8Char);
  Case UCharLen of
1: Result := StringOfChar(AUtf8Char[1], N);
2:
begin
  SetLength(Result, 2 * N);
  W := Byte(AUtf8Char[1]) + (Word(Byte(AUtf8Char[2])) shl 8);
  W := LeToN(W);
  System.FillWord(Result[1], N, W);;
 end;
3:
begin
  SetLength(Result, 3 * N);
  C1 := AUtf8Char[1];
  C2 := AUtf8Char[2];
  C3 := AUtf8Char[3];
  PC := PChar(Result);
  for i:=1 to N do
  begin
PC^ := C1; inc(PC);
PC^ := C2; inc(PC);
PC^ := C3; inc(PC);
  end;
end;
4:
begin
  SetLength(Result, 4 * N);
  DW := Byte(AUtf8Char[1]) + (Word(Byte(AUtf8Char[2])) shl 8) +
(Byte(AUtf8Char[3]) + (Word(Byte(AUtf8Char[4])) shl 8)) shl 16;
  DW := LeToN(DW);
  System.FillDWord(Result[1], N, DW);
end;
else
begin
  //In November 2003 UTF-8 was restricted by RFC 3629 to four bytes to match
  //the constraints of the UTF-16 character encoding.
  //http://en.wikipedia.org/wiki/UTF-8
  Result := StringOfChar('?', N);
end;
  end;
end;


End here's the testing code in an OnClick event of a Button:

It uses a console for output, so either run it from console, redirect
output, or replace the write/writeln with e.g. memo1.lines.add().

procedure TForm1.Button6Click(Sender: TObject);
var
  ResS: String;
  UChar: String;
  Expected: String;
  i,j,k: Integer;
const
  N = 4;
  Utf8Sample: Array[1..3] of String = (#$C2#$A2,// ¢
   #$E2#$82#$AC,// €
   #$F0#$A4#$AD#$A2 // 䭢
   );
begin
  for k := 1 to 3 do
  begin
UChar := Utf8Sample[k];
Expected := '';
for i := 1 to N do Expected := Expected + UChar;

ResS := Utf8StringOfChar(UChar, N);
write('Testing: ');
write(Length(UChar),'-byte codepoint: ');
for j := 1 to length(UChar) do write('$',IntToHex(Ord(UChar[j]),2),' ');
writeln;
writeln('Expected Length = ',Length(Expected));
writeln('Found Length= ',Length(ResS));
write('Expected: ');
for i := 1 to length(Expected) do
write('$',IntToHex(Ord(Expected[i]),2),' ');
writeln;
write('Found   : ');
for i := 1 to length(ResS) do write('$',IntToHex(Ord(ResS[i]),2),' ');
writeln;
if ResS  Expected then
begin
  if Length(ResS)  Length(Expected) then
writeln('Different Lengths')
  else
  begin
i := 1;
while (length(ResS) = i) and (ResS[i] = Expected[i]) do Inc(i);
writeln('Fail: at position ',i,':  Expected =
$',IntToHex(Ord(Expected[i]),2),' Found =
$',IntToHex(Ord(ResS[i]),2));

  end;
end
else writeln('Success!');
writeln;
  end;

end;


B.t.w the new code isn't in trunk yet, I'ld rather first fix it if
it's broke on BigEndian machines.

Thanks in advance.

Bart

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code

2013-09-16 Thread Bart
On 9/16/13, Hans-Peter Diettrich drdiettri...@aol.com wrote:


 Did you also test the simpler approach, replicating the pattern in one
 loop? It's independent of endianness, and can boil down to a single
 machine instruction (x86: REP MOVS).

It would be repeating either 2,3, or 4-bytes each time.
How would you code that?



Simplified version, should be Endian safe:

function Utf8StringOfChar(AUtf8Char: Utf8String; N: Integer): Utf8String;
var
  UCharLen, i: Integer;
  C1, C2, C3: Char;
  PC: PChar;
begin
  Result := '';
  if (N = 0) or (Utf8Length(AUtf8Char)  1) then Exit;
  UCharLen := Length(AUtf8Char);
  Case UCharLen of
1: Result := StringOfChar(AUtf8Char[1], N);
2:
begin
  SetLength(Result, 2 * N);
  System.FillWord(Result[1], N, PWord(Pointer(AUtf8Char))^);;
 end;
3:
begin
  SetLength(Result, 3 * N);
  C1 := AUtf8Char[1];
  C2 := AUtf8Char[2];
  C3 := AUtf8Char[3];
  PC := PChar(Result);
  for i:=1 to N do
  begin
PC^ := C1; inc(PC);
PC^ := C2; inc(PC);
PC^ := C3; inc(PC);
  end;
end;
4:
begin
  SetLength(Result, 4 * N);
  System.FillDWord(Result[1], N, PDWord(Pointer(AUtf8Char))^);
end;
else
begin
  //In November 2003 UTF-8 was restricted by RFC 3629 to four bytes to match
  //the constraints of the UTF-16 character encoding.
  //http://en.wikipedia.org/wiki/UTF-8
  Result := StringOfChar('?', N);
end;
  end;
end;

Bart

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code

2013-09-16 Thread Hans-Peter Diettrich

Bart schrieb:

Hi,

Current code for Utf8StringOfChar that I wrote (in LazUtf8 unit) may
fail due to Utf8 - UTF16 - FillWord - Utf8 conversions, which only
work for UCS2, as Mattias pointed out to me.

I constructed a new Utf8StringOfChar function that builds UTF8 without
conversions.
For speed reasons it uses FillWord or FillDWord when appropriate.


Did you also test the simpler approach, replicating the pattern in one 
loop? It's independent of endianness, and can boil down to a single 
machine instruction (x86: REP MOVS).


DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code

2013-09-16 Thread Hans-Peter Diettrich

Bart schrieb:


Did you also test the simpler approach, replicating the pattern in one
loop? It's independent of endianness, and can boil down to a single
machine instruction (x86: REP MOVS).


It would be repeating either 2,3, or 4-bytes each time.
How would you code that?


I would not care.


function Utf8StringOfChar(AUtf8Char: Utf8String; N: Integer): Utf8String;
var
  UCharLen, i,nb: Integer;
  PC: PChar;
begin
  Result := '';
  UCharLen := Length(AUtf8Char);

nb := N*UCharLen;
if nb = 0 then exit;

  //3:

  Result := AUtf8Char;

  SetLength(Result, nb);

PC := Result;

  for i:=1 to nb-UCharLen do
  begin
PC[UCharLen] := PC[0] inc(PC);
  end;
end;


DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus