Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code
On 9/18/13, Mark Morgan Lloyd wrote: > Bart wrote: >> On 9/17/13, Mark Morgan Lloyd wrote: >>> Hans-Peter Diettrich wrote: >> >>> I've missed part of this thread due to messages dropped at our gateway, >>> but am currently trying to check on SPARC Linux. OK, so it works as expected. I will commit it then. Any optimizations can be added later. Please post them in the bugtracker, and (if have the rights) assign it to me. Everybody here thanks for the help. Bart -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code
El 17/09/2013 17:25, Bart escribió: I thought f performance when calling like Utf8StringOfChare(AnUtf8Char, 32768), This should perform better if Fill(D)Word can be used I think. Hello, Yes, it should, but you can place a check that if less than 64 bytes are used, do not call Fill(D)Word but use plain for..loop to fill the data. I'm quite sure that you will get a bit more speed. I had not fully checked you code, but I think you are coding always the UTF16 as little endian, and in big endian machines the UTF16 should be big endian also. In my coding (in this thread) there is no UTF16... Yes sorry... Age is killing me :) I saw the "shr" and the LEToN and inmediatly think in UTF16. Sorry. -- -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code
José Mejuto wrote: I had not fully checked you code, but I think you are coding always the UTF16 as little endian, and in big endian machines the UTF16 should be big endian also. I do not have a big endian machine in order to perform some tests. I generally have 32-bit big-endian systems available, but not necessarily with the most recent Lazarus and/or FPC so if in doubt it's safer to code a completely standalone example (i.e. as Bart did). -- Mark Morgan Lloyd markMLl .AT. telemetry.co .DOT. uk [Opinions above are the author's, not those of his employers or colleagues] -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code
Bart wrote: On 9/17/13, Mark Morgan Lloyd wrote: Hans-Peter Diettrich wrote: I've missed part of this thread due to messages dropped at our gateway, but am currently trying to check on SPARC Linux. Here's the comple code for you: Code as posted plus your earlier event handler gives output as below: Testing: 2-byte codepoint: $C2 $A2 Expected Length = 8 Found Length= 8 Expected: $C2 $A2 $C2 $A2 $C2 $A2 $C2 $A2 Found : $C2 $A2 $C2 $A2 $C2 $A2 $C2 $A2 Success! Testing: 3-byte codepoint: $E2 $82 $AC Expected Length = 12 Found Length= 12 Expected: $E2 $82 $AC $E2 $82 $AC $E2 $82 $AC $E2 $82 $AC Found : $E2 $82 $AC $E2 $82 $AC $E2 $82 $AC $E2 $82 $AC Success! Testing: 4-byte codepoint: $F0 $A4 $AD $A2 Expected Length = 16 Found Length= 16 Expected: $F0 $A4 $AD $A2 $F0 $A4 $AD $A2 $F0 $A4 $AD $A2 $F0 $A4 $AD $A2 Found : $F0 $A4 $AD $A2 $F0 $A4 $AD $A2 $F0 $A4 $AD $A2 $F0 $A4 $AD $A2 Success! -- Mark Morgan Lloyd markMLl .AT. telemetry.co .DOT. uk [Opinions above are the author's, not those of his employers or colleagues] -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code
On 9/17/13, Mark Morgan Lloyd wrote: > Hans-Peter Diettrich wrote: > I've missed part of this thread due to messages dropped at our gateway, > but am currently trying to check on SPARC Linux. Here's the comple code for you: function Utf8StringOfChar(AUtf8Char: Utf8String; N: Integer): Utf8String; var UCharLen, i: Integer; C1, C2, C3: Char; PC: PChar; begin Result := ''; if (N <= 0) or (Utf8Length(AUtf8Char) <> 1) then Exit; UCharLen := Length(AUtf8Char); Case UCharLen of 1: Result := StringOfChar(AUtf8Char[1], N); 2: begin SetLength(Result, 2 * N); System.FillWord(Result[1], N, PWord(Pointer(AUtf8Char))^);; end; 3: begin SetLength(Result, 3 * N); C1 := AUtf8Char[1]; C2 := AUtf8Char[2]; C3 := AUtf8Char[3]; PC := PChar(Result); for i:=1 to N do begin PC^ := C1; inc(PC); PC^ := C2; inc(PC); PC^ := C3; inc(PC); end; end; 4: begin SetLength(Result, 4 * N); System.FillDWord(Result[1], N, PDWord(Pointer(AUtf8Char))^); end; else begin //In November 2003 UTF-8 was restricted by RFC 3629 to four bytes to match //the constraints of the UTF-16 character encoding. //http://en.wikipedia.org/wiki/UTF-8 Result := StringOfChar('?', N); end; end; end; Bart -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code
On 9/17/13, Mark Morgan Lloyd wrote: > Hans-Peter Diettrich wrote: HPD code 3: begin Result := AUtf8Char; SetLength(Result, nb); PC := PChar(Result); for i:=1 to nb - UCharLen do begin PC[UCharLen] := PC[0]; //very nice b.t.w. inc(PC); end; end; My code 3: begin SetLength(Result, 3 * N); C1 := AUtf8Char[1]; C2 := AUtf8Char[2]; C3 := AUtf8Char[3]; PC := PChar(Result); for i:=1 to N do begin PC^ := C1; inc(PC); PC^ := C2; inc(PC); PC^ := C3; inc(PC); end; end; I tested that for speed, and though they consistently differ about 2-6%, on further inspection this is dependant on the order the 2 are executed: the first one is always the slower... So, I conclude they perform roughly the same. Bart -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code
Bart schrieb: On 9/17/13, Mark Morgan Lloyd wrote: Hans-Peter Diettrich wrote: HPD code 3: begin Result := AUtf8Char; SetLength(Result, nb); PC := PChar(Result); for i:=1 to nb - UCharLen do begin PC[UCharLen] := PC[0]; //very nice b.t.w. inc(PC); end; end; My code 3: begin SetLength(Result, 3 * N); C1 := AUtf8Char[1]; C2 := AUtf8Char[2]; C3 := AUtf8Char[3]; PC := PChar(Result); for i:=1 to N do begin PC^ := C1; inc(PC); PC^ := C2; inc(PC); PC^ := C3; inc(PC); end; end; I tested that for speed, and though they consistently differ about 2-6%, on further inspection this is dependant on the order the 2 are executed: the first one is always the slower... So, I conclude they perform roughly the same. Thanks for testing :-) Did you only test the 3-byte case, or also for WORD and DWORD cases? I left the "3:" only as a comment to the pattern taken from your code, while my version works for any length of the AUTF8Char, not only for the 3-byte case. It's suited to allow for compiler optimizations, so that e.g. REP MOVSB or other streaming instructions could be used for the entire loop, as available for a target CPU. This feature makes the function usable even for replication of entire strings, of any encoding, when the parameter and result type is changed to RawByteString. DoDi -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code
On 9/17/13, Hans-Peter Diettrich wrote: >> So, I conclude they perform roughly the same. > > Thanks for testing :-) > > Did you only test the 3-byte case, or also for WORD and DWORD cases? I My FillDWord code is 20 times faster (with N being equal, so Length(Result) being 1.33 times bigger) faster thean the code used for the 3-byte sequence. The FillWord code is 28 times faster than 3-byte code (N-2-byte being 1,5 * N-3-byte, so Length(Result) being equal). Testing with all results being of equal length: 1 calls to Utf8StringOfChar (12 * 1-byte): 15 ticks. 1 calls to Utf8StringOfChar (6 * 2-byte): 16 ticks. //FillWord 1 calls to Utf8StringOfChar (4 * 3-byte): 561 ticks. 1 calls to Utf8StringOfChar (3 * 4-byte): 16 ticks. //FillDword Bart -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code
On 9/17/13, Mark Morgan Lloyd wrote: > I've missed part of this thread due to messages dropped at our gateway, > but am currently trying to check on SPARC Linux. Sh@# happens... > > Bart, where's Utf8Length() imported from, of do you have your own > implementation that's newer than Lazarus 1.0 + FPC 2.6.0? No, sorry it's in LazUtf8. UTF8 isn't that much fun that I wanted to make my own function for that ;-) Bart -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code
On 9/17/13, José Mejuto wrote: > In most architectures (32 and 64 bits at least) memory moves of less > than 16 bytes are faster done one by one. Just the call to the function > kills the possible performance gain. I thought f performance when calling like Utf8StringOfChare(AnUtf8Char, 32768), This should perform better if Fill(D)Word can be used I think. > I had not fully checked you code, but I think you are coding always the > UTF16 as little endian, and in big endian machines the UTF16 should be > big endian also. In my coding (in this thread) there is no UTF16... Bart -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code
El 17/09/2013 10:48, Bart escribió: I would think that FiilWord and FillDWord is faster than that. For 3-byte sequence I almost did the same as you. Hello, In most architectures (32 and 64 bits at least) memory moves of less than 16 bytes are faster done one by one. Just the call to the function kills the possible performance gain. More to the point: does it produce correct results on BigEndian. Theory (as in: as far as I can see and "compute in my head") says yes, but I would like confirmation (on my secons code example). I had not fully checked you code, but I think you are coding always the UTF16 as little endian, and in big endian machines the UTF16 should be big endian also. I do not have a big endian machine in order to perform some tests. -- José Mejuto -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code
Hans-Peter Diettrich wrote: Bart schrieb: Did you also test the simpler approach, replicating the pattern in one loop? It's independent of endianness, and can boil down to a single machine instruction (x86: REP MOVS). It would be repeating either 2,3, or 4-bytes each time. How would you code that? I would not care. I've missed part of this thread due to messages dropped at our gateway, but am currently trying to check on SPARC Linux. Bart, where's Utf8Length() imported from, of do you have your own implementation that's newer than Lazarus 1.0 + FPC 2.6.0? -- Mark Morgan Lloyd markMLl .AT. telemetry.co .DOT. uk [Opinions above are the author's, not those of his employers or colleagues] -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code
On 9/17/13, Hans-Peter Diettrich wrote: >> function Utf8StringOfChar(AUtf8Char: Utf8String; N: Integer): Utf8String; >> var >> UCharLen, i,nb: Integer; >> PC: PChar; >> begin >> Result := ''; >> UCharLen := Length(AUtf8Char); > nb := N*UCharLen; > if nb <= 0 then exit; >> //3: >Result := AUtf8Char; >> SetLength(Result, nb); > PC := Result; >> for i:=1 to nb-UCharLen do >> begin >> PC[UCharLen] := PC[0] inc(PC); >> end; >> end; > I would think that FiilWord and FillDWord is faster than that. For 3-byte sequence I almost did the same as you. More to the point: does it produce correct results on BigEndian. Theory (as in: as far as I can see and "compute in my head") says yes, but I would like confirmation (on my secons code example). Bart -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code
Bart schrieb: Did you also test the simpler approach, replicating the pattern in one loop? It's independent of endianness, and can boil down to a single machine instruction (x86: REP MOVS). It would be repeating either 2,3, or 4-bytes each time. How would you code that? I would not care. function Utf8StringOfChar(AUtf8Char: Utf8String; N: Integer): Utf8String; var UCharLen, i,nb: Integer; PC: PChar; begin Result := ''; UCharLen := Length(AUtf8Char); nb := N*UCharLen; if nb <= 0 then exit; //3: Result := AUtf8Char; SetLength(Result, nb); PC := Result; for i:=1 to nb-UCharLen do begin PC[UCharLen] := PC[0] inc(PC); end; end; DoDi -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code
Bart schrieb: Hi, Current code for Utf8StringOfChar that I wrote (in LazUtf8 unit) may fail due to Utf8 -> UTF16 -> FillWord -> Utf8 conversions, which only work for UCS2, as Mattias pointed out to me. I constructed a new Utf8StringOfChar function that builds UTF8 without conversions. For speed reasons it uses FillWord or FillDWord when appropriate. Did you also test the simpler approach, replicating the pattern in one loop? It's independent of endianness, and can boil down to a single machine instruction (x86: REP MOVS). DoDi -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code
On 9/16/13, Hans-Peter Diettrich wrote: > > Did you also test the simpler approach, replicating the pattern in one > loop? It's independent of endianness, and can boil down to a single > machine instruction (x86: REP MOVS). It would be repeating either 2,3, or 4-bytes each time. How would you code that? Simplified version, should be Endian safe: function Utf8StringOfChar(AUtf8Char: Utf8String; N: Integer): Utf8String; var UCharLen, i: Integer; C1, C2, C3: Char; PC: PChar; begin Result := ''; if (N <= 0) or (Utf8Length(AUtf8Char) <> 1) then Exit; UCharLen := Length(AUtf8Char); Case UCharLen of 1: Result := StringOfChar(AUtf8Char[1], N); 2: begin SetLength(Result, 2 * N); System.FillWord(Result[1], N, PWord(Pointer(AUtf8Char))^);; end; 3: begin SetLength(Result, 3 * N); C1 := AUtf8Char[1]; C2 := AUtf8Char[2]; C3 := AUtf8Char[3]; PC := PChar(Result); for i:=1 to N do begin PC^ := C1; inc(PC); PC^ := C2; inc(PC); PC^ := C3; inc(PC); end; end; 4: begin SetLength(Result, 4 * N); System.FillDWord(Result[1], N, PDWord(Pointer(AUtf8Char))^); end; else begin //In November 2003 UTF-8 was restricted by RFC 3629 to four bytes to match //the constraints of the UTF-16 character encoding. //http://en.wikipedia.org/wiki/UTF-8 Result := StringOfChar('?', N); end; end; end; Bart -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
[Lazarus] Mac (or other BigEndian machine) users needed to test new Utf8StringOfChar code
Hi, Current code for Utf8StringOfChar that I wrote (in LazUtf8 unit) may fail due to Utf8 -> UTF16 -> FillWord -> Utf8 conversions, which only work for UCS2, as Mattias pointed out to me. I constructed a new Utf8StringOfChar function that builds UTF8 without conversions. For speed reasons it uses FillWord or FillDWord when appropriate. To be sure this is Endian safe (It works AFAICS on Windows i386) I need Mac users (or testers with other BigEndian architecture) to test the code: function Utf8StringOfChar(AUtf8Char: Utf8String; N: Integer): Utf8String; var UCharLen, i: Integer; W: Word; DW: DWORD; C1, C2, C3: Char; PC: PChar; begin Result := ''; if Utf8Length(AUtf8Char) <> 1 then Exit; UCharLen := Length(AUtf8Char); Case UCharLen of 1: Result := StringOfChar(AUtf8Char[1], N); 2: begin SetLength(Result, 2 * N); W := Byte(AUtf8Char[1]) + (Word(Byte(AUtf8Char[2])) shl 8); W := LeToN(W); System.FillWord(Result[1], N, W);; end; 3: begin SetLength(Result, 3 * N); C1 := AUtf8Char[1]; C2 := AUtf8Char[2]; C3 := AUtf8Char[3]; PC := PChar(Result); for i:=1 to N do begin PC^ := C1; inc(PC); PC^ := C2; inc(PC); PC^ := C3; inc(PC); end; end; 4: begin SetLength(Result, 4 * N); DW := Byte(AUtf8Char[1]) + (Word(Byte(AUtf8Char[2])) shl 8) + (Byte(AUtf8Char[3]) + (Word(Byte(AUtf8Char[4])) shl 8)) shl 16; DW := LeToN(DW); System.FillDWord(Result[1], N, DW); end; else begin //In November 2003 UTF-8 was restricted by RFC 3629 to four bytes to match //the constraints of the UTF-16 character encoding. //http://en.wikipedia.org/wiki/UTF-8 Result := StringOfChar('?', N); end; end; end; End here's the testing code in an OnClick event of a Button: It uses a console for output, so either run it from console, redirect output, or replace the write/writeln with e.g. memo1.lines.add(). procedure TForm1.Button6Click(Sender: TObject); var ResS: String; UChar: String; Expected: String; i,j,k: Integer; const N = 4; Utf8Sample: Array[1..3] of String = (#$C2#$A2,// ¢ #$E2#$82#$AC,// € #$F0#$A4#$AD#$A2 // 𤭢 ); begin for k := 1 to 3 do begin UChar := Utf8Sample[k]; Expected := ''; for i := 1 to N do Expected := Expected + UChar; ResS := Utf8StringOfChar(UChar, N); write('Testing: '); write(Length(UChar),'-byte codepoint: '); for j := 1 to length(UChar) do write('$',IntToHex(Ord(UChar[j]),2),' '); writeln; writeln('Expected Length = ',Length(Expected)); writeln('Found Length= ',Length(ResS)); write('Expected: '); for i := 1 to length(Expected) do write('$',IntToHex(Ord(Expected[i]),2),' '); writeln; write('Found : '); for i := 1 to length(ResS) do write('$',IntToHex(Ord(ResS[i]),2),' '); writeln; if ResS <> Expected then begin if Length(ResS) <> Length(Expected) then writeln('Different Lengths') else begin i := 1; while (length(ResS) >= i) and (ResS[i] = Expected[i]) do Inc(i); writeln('Fail: at position ',i,': Expected = $',IntToHex(Ord(Expected[i]),2),' Found = $',IntToHex(Ord(ResS[i]),2)); end; end else writeln('Success!'); writeln; end; end; B.t.w the new code isn't in trunk yet, I'ld rather first fix it if it's broke on BigEndian machines. Thanks in advance. Bart -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus