date:20120821

Re: [fpc-devel] FPC -Rintel and -alr options

2012-08-21 Thread Jonas Maebe


On 21 Aug 2012, at 08:32, ABorka wrote:

> On 8/20/2012 22:37, Sergei Gorelkin wrote:
>> -R switch controls parsing assembler blocks in the code.
>> The output format is set with -A  (e.g. -Amasm will produce Intel syntax).
> 
> That requires masm to compile the project.

Only if the compiler calls the assembler. You can use the -s parameter to 
prevent it from doing that.


Jonas___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] FPC -Rintel and -alr options

2012-08-21 Thread Sergei Gorelkin

21.08.2012 10:32, ABorka пишет:

That requires masm to compile the project.

Likewise, using "-al" requires GNU AS to compile. The latter is typically installed together with
FPC, so it just works transparently.

What I actually want is to see the disassembled code from my project (as Intel
Syntax assembly code)
to look at how the win32 and win64 code is optimized and then to fine tune the
pascal source.

FPC does not generate *disassembled* code. It generates code that is subject to be assembled -
either with internal assembler or by one of external tools.

"-al" (and in general any "-a") switch implies using an external assembler, which is GNU AS by
default on most targets. It can be overridden with "-A". Thus, "-al -Amasm" will generate MASM
format listing with line information inserted.

For the purposes of reading assembler source you can also add "-s" switch, which will stop after
producing the listing (MASM output probably won't assemble even if you have ml.exe available).

1. The objdump ("objdump.exe" and "x86_64-win64-objdump.exe") utility from the
binutils programs
works and displays the code with the -Mintel option as Intel syntax from the
"*.o" files, however
when I try to see the line numbers (objdump.exe -l ...) and source code lines
(objdump.exe -S ...)
it does not put them in the right places within the disassembled code.

As an example, I used:
objdump.exe -d -Mintel -w -l -S -EL something.o > something.disassembled

This may be an issue to fix (or maybe not, given that gdb usually locates source lines correctly
using the same line information from the object files).

2. The fpc "-al" flag generates a nice "*.s" assembler file during compilation,
where everything is
in place, except it is not Intel syntax assembly but AT&T.

Anyone had experience with something like this (display the disassembled code
of a project
unit/object_file in Intel syntax, with source code lines)?

Thanks for any help

Regards,
Sergei

___
fpc-devel maillist - fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Michael Schnell


On 08/20/2012 08:33 PM, Graeme Geldenhuys wrote:

Such a restriction should NEVER be okay!
How _can_ it be OK regarding comparing strings, when all Unicode 
variants allow for multiple codings for the same single printable 
"character" (and moreover what "character" do the users regard as "equal").


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Michael Schnell


On 08/20/2012 08:53 PM, Ivanko B wrote:

Really the team seems to fights to FPC + Lazarus be capable of
building thousands of Delphi based components - archivers, cyphers,
audio processors etc things which people mostly like Delphi for and
which seldom use specific Delphi features causing problems to FPC.


ASYNCPRO (http://sourceforge.net/projects/tpapro/  )


PLEASE 


I doubt that it will be possible to just compile it (e.g. for Linux) but 
with optimum compatibility of the compiler, porting  the source code 
should be rather easy.



- Michel
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] FPC -Rintel and -alr options

2012-08-21 Thread ABorka


This is exactly what I needed.
"-alr -sr -Amasm" does it. I just put them into my "fpc.cfg" .

Thanks for all the help guys.

On 8/21/2012 00:16, Jonas Maebe wrote:


On 21 Aug 2012, at 08:32, ABorka wrote:


On 8/20/2012 22:37, Sergei Gorelkin wrote:

-R switch controls parsing assembler blocks in the code.
The output format is set with -A  (e.g. -Amasm will produce Intel syntax).


That requires masm to compile the project.


Only if the compiler calls the assembler. You can use the -s parameter to 
prevent it from doing that.


Jonas___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel




___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Michael Schnell


On 08/20/2012 06:05 PM, Graeme Geldenhuys wrote:
* UnicodeString is always UTF-16 (so everything but Windows takes a 
conversion penalty)!

This is true of course,

But does that really suggest taking the effort to support other Unicode 
variants ?


The conversion is done only when entering and exiting the OS / GUI 
framework calls. I understand this does not happen too often.


Of course it is really nice to provide support for any Unicode encoding, 
but I don't think it does not harm to use UTF-16 as a default.


-Michael (who doe not usually advocate Windows-centric development)
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] FPC -Rintel and -alr options

2012-08-21 Thread ABorka


It would be nice to see it work with objdump also, but not a priority.
With your help guys I was able to get the needed output using the 
fpc.cfg and the FPC parameters you guys mentioned.


Thanks for the help

<...snip...>
On 8/21/2012 00:19, Sergei Gorelkin wrote:

21.08.2012 10:32, ABorka пишет:

1. The objdump ("objdump.exe" and "x86_64-win64-objdump.exe") utility
from the binutils programs
works and displays the code with the -Mintel option as Intel syntax
from the "*.o" files, however
when I try to see the line numbers (objdump.exe -l ...) and source
code lines (objdump.exe -S ...)
it does not put them in the right places within the disassembled code.

As an example, I used:
objdump.exe -d -Mintel -w -l -S -EL something.o > something.disassembled


This may be an issue to fix (or maybe not, given that gdb usually
locates source lines correctly using the same line information from the
object files).

<...snip...>

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Michael Schnell


Sorry:
I do think it would not harm to use UTF-16 as a default.

-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Graeme Geldenhuys

HI,

On 20 August 2012 23:26, Hans-Peter Diettrich  wrote:
>
> UCS2 is nowadays known as the BMP (Basic Multilingual Plane) of full
> Unicode.

The UCS2 is considered obsolete! Nothing else needs to be said. :)

> Have a look at the full Unicode codepages, what is and what is not
> part of the BMP.

Download the Unicode eBook chapters and look at the section "Details
of Allocation" normally in Chapter 2. There are lots of useful things
in Plane 1-16 (BMP being Plane 0).

Like I said, outside the BMP, less is used for the spoken languages,
but there are some CJK characters in Plane 2. Plane 1 has many
interesting things for "modern applications" like Domino Tiles,
Advanced Math symbols (which we use in our company), Map Symbols often
seen on GPS units, Mahjong Tiles, Musical symbols, Smiley Face symbols
used in emails and IM programs, System of Divination symbols, Large
Historic Script area, Large private areas (any apps that ship custom
fonts could use those private areas - MacOSX uses the private areas
for the "apple" symbol in their fonts) and many more.

The reluctance to support Planes 1-16 is often seen by UTF-16 or UCS-2
users, because they are just lazy! That is just wrong. I'll say it
again, that is what makes UTF-8 so great. NO special handling or
implementation is required to handle the _whole_ Unicode character
set. Implement UTF-8 to handle the BMP (Plane 0), and you get handling
for Plane 1-16 for free.

-- 
Regards,
  - Graeme -

___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Mattias Gaertner

On Mon, 20 Aug 2012 18:46:29 +0100
Hans-Peter Diettrich  wrote:

> Mattias Gaertner schrieb:
> 
> > I guess most people would say that "good multi language Unicode support
> > in FPC" requires a Unicode supporting RTL.
> 
> Please clarify: *Unicode* or UTF-16 support?
> 
> Unicode is covered by both UTF-8 and UTF-16, so it's not really 
> important which encoding is used in the supporting procedures.

I know.

I'm only saying that for most people an utf-8 and/or utf-16
string type is not enough. For example common file function
currently support only 8-bit windows codepage.
Users expect to handle files x-platform without using external libraries
like LazUtils.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] FPC -Rintel and -alr options

2012-08-21 Thread Sven Barth


Am 21.08.2012 09:35, schrieb ABorka:

This is exactly what I needed.
"-alr -sr -Amasm" does it. I just put them into my "fpc.cfg" .


Why did you put this into your fpc.cfg? You are aware that with the "-s" 
switch no binary code is generated? Or are you protecting that with an 
IFDEF?


Regards,
Sven

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Ivanko B

For non-fixed char length there's nothing better than UTF8 (default
ASCII compatible, ready for any future alphabets,..). For fixed-char
length (fast string operations etc) also there's nothing better than
UCS-2 (the Earth coverage ) & UCS-4 (the galaxy coverage).
The non-fixed char length UTF-16 (UCS-2 + surrogate pairs) looks less
efficient than UTF-8 almost from any look point.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Ivanko B

For non-fixed char length there's nothing better than UTF8 (default
 ASCII compatible, ready for any future alphabets,..). For fixed-char
 length (fast string operations etc) also there's nothing better than
 UCS-2 (the Earth coverage ) & UCS-4 (the galaxy coverage).
 The non-fixed char length UTF-16 (UCS-2 + surrogate pairs) looks less
 efficient than UTF-8 almost from any look point.
==
It assumes having FPC RTL optimized string functions for UTF-8 (
rudimental nowadays - via conversion procedures), UCS-2 (not all
functions nowadays) & UCS-4 (none nowadays).
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Martin Schreiber

On Tuesday 21 August 2012 09:56:57 Ivanko B wrote:
> For non-fixed char length there's nothing better than UTF8 (default
> ASCII compatible, ready for any future alphabets,..). For fixed-char
> length (fast string operations etc) also there's nothing better than
> UCS-2 (the Earth coverage ) & UCS-4 (the galaxy coverage).
> The non-fixed char length UTF-16 (UCS-2 + surrogate pairs) looks less
> efficient than UTF-8 almost from any look point.

I disagree. Handling 1..4(6) bytes is less efficient than handling surrogate 
*pairs*.

Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Graeme Geldenhuys

On 21 August 2012 08:27, Michael Schnell  wrote:
>
> I doubt that it will be possible to just compile it (e.g. for Linux) but
> with optimum compatibility of the compiler, porting  the source code should
> be rather easy.

You're in for a surprise... With a statement that reads "It provides
direct access to serial ports, TAPI, and the Microsoft Speech API." it
should start sounding alarm bells for Linux developers. TAPI and MS
Speech API doesn't exist under Linux, so requires completely new
implementations from the ground up. Direct serial port access is
probably very different under Linux too.

I have ported (or fixed previous port attempts) of Turbo Power
products to work x-platform. Turbo Power loved to use Windows API
calls etc, and that makes porting a non-trivial task. But with lots of
effort, anything is obviously possible. Not to mention that sometimes
things are just easier when you start an implementation from scratch
in a x-platform manner.

Anyway, I don't know how this relates to the current message thread,
so I'll stop here.

-- 
Regards,
  - Graeme -

___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Ivanko B

Handling 1..4(6) bytes is less efficient than handling surrogate
 *pairs*.
===
But surrogate pairs break array-like fast char access anyway,  isn't it ?
And there's a lot of room for optimizing utf-8 operation for instance
http://bjoern.hoehrmann.de/utf-8/decoder/dfa/.
Also a publication at http://www.utf8everywhere.org/.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Graeme Geldenhuys

Hi,

On 21 August 2012 08:28, Michael Schnell  wrote:
>
> How can it be OK regarding comparing strings, when all Unicode variants
> allow for multiple codings for the same single printable "character" (and
> moreover what "character" do the users regard as "equal").


The Unicode Standard covers all this. It short, if you want to do
string comparisons, one option is to normalise the text before you do
a compare.


-- 
Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Graeme Geldenhuys

Hi,

On 21 August 2012 08:37, Michael Schnell  wrote:
>
> But does that really suggest taking the effort to support other Unicode
> variants ?

Yes, if you want to to make the statement "FPC fully supports Unicode"

> The conversion is done only when entering and exiting the OS / GUI framework
> calls. I understand this does not happen too often.

I beg to differ.

> Of course it is really nice to provide support for any Unicode encoding, but
> I don't think it does not harm to use UTF-16 as a default.

As Florian said... UTF-16 was implemented first in the FPC Compiler.
Others are welcome to add UTF-8 as the default for other FPC platforms
like Linux. There are a lot more Unix-type platforms than the one
Windows platform, so why must all the other platforms take a
conversion hit. I know nothing about compiler internals, but I hope to
rise to the challenge Florian set.

-- 
Regards,
  - Graeme -

___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Graeme Geldenhuys

On 21 August 2012 09:13, Martin Schreiber  wrote:
> I disagree. Handling 1..4(6) bytes is less efficient than handling surrogate
> *pairs*.

Yet another myth But if you are such a UTF-16 (actually UCS-2 as
that is what MSEgui supports) fan, why isn't MSEgui source code stored
in UTF-16 encoding either? ;-) There is good reason why UTF-8 is so
popular. And by the way, a UTF-8 codepoint is only 1-4 bytes in size.

Anyway, this message thread is not about a UTF-16 vs UTF-8 pissing
match, it is about FPC boasting good all-round Unicode support in the
most efficient manner on all supported platforms.

-- 
Regards,
  - Graeme -

___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Mattias Gaertner

On Mon, 20 Aug 2012 20:56:46 +0200
Florian Klämpfl  wrote:

>[...]
> The current situation is:
> - either somebody starts to implement support for unicodestring being
> utf-8 (or whatever) on linux in a compatible way with the current
> approach, then 2.8.0 will use this
> - nobody works on it, then 2.8.0 comes with unicodestring=utf-16 always.

IMO unicodestring should be the same on all platforms, because
otherwise the character size switches per platform, which is hard to
test and asking for trouble.

The compiler already supports an UTF8String, right?
If yes, then some functions can use UTF8String, some UnicodeString
(=UTF-16) and the compiler magic will convert automatically.

The difficult decision is what functions and types should use UTF-8
and what UTF-16. This may depend on the platform.

One problem is that an UTF-8/16 string can contain invalid characters
making it impossible to convert.
For example under Linux file names are treated as UTF-8 but are only
bytes. They can and they do contain invalid UTF-8 characters.
If your program should support this, you must use a FindFirst
with UTF-8. To be clear: I don't say the default FindFirst under Linux
must be UTF-8, I only say, there must be one version with UTF-8, e.g.
FindFirstU8 and that must directly use the Linux file functions
without conversions.

I guess there is no good solution for TStrings. Whatever string type is
chosen, some programs will suffer.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Ivanko B

But if you are such a UTF-16 (actually UCS-2 as
 that is what MSEgui supports) fan
=
If Martin can implement UTF-16 (with surrogate pair) support in MSEgui
string units (and these units fully cover absenting code of FPC RTL )
then the things are excellent.

PS:
UTF-8 is very-very slow compared to UCS-2 as to string manipulations
so its best usage is encoding source files (as done in MSEide).
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Graeme Geldenhuys

Hi,

On 20 August 2012 23:18, Hans-Peter Diettrich  wrote:
> The Delphi developers wanted to implement what you suggest, but dropped that
> approach later again.

When Embarcadero implemented Unicode support, Delphi was a pure
Windows application. They had no need to think of anything other than
what Windows supports. Not to mention that they were on a tight budget
and time constraint, because every minute they waisted, they lost
clients moving to more "up to date" compilers and languages. So it was
all about getting something out as quickly as possible, and probably
cutting corners where possible.

> A character type is somewhat useless, unless all strings are UTF-32 (what's
> quite unlikely now). Instead substrings should be used, which can contain
> any number of bytes or characters.

I guess that depends on how you define the Char type. Is it meant to
hold a single Unicode codepoint, or a single printable character. If
the latter, then probably a bigger Char type is required.

> You also have to explain what String[4] means in an Unicode environment.

The String[] syntax in Object Pascal means you are defining a
shortstring type (irrespective of compiler mode), thus an array of
bytes. In this case 4-bytes are used to hold any Unicode codepoint.

> Q: Did you ever read about the new string implementation of FPC?

I have read some of the message threads that went around in fpc-devel,
I also worked on the cp branch before it was merged with Trunk. If you
have any other "documentation" in mind, please post the URL and I'll
happily take a look.

-- 
Regards,
  - Graeme -

___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Martin Schreiber


Am 21.08.2012 09:31, schrieb Graeme Geldenhuys:

On 21 August 2012 09:13, Martin Schreiber  wrote:

I disagree. Handling 1..4(6) bytes is less efficient than handling surrogate
*pairs*.


Yet another myth


Ehm, I did both. In the beginning MSEgui switched from Widestring to 
utf-8 encoded Ansistring because of the buggy FPC widestring 
implementation (MSEgui started with Delphi/Kylix). Some weeks later I 
switched back to widestring and bite the bullet to write FPC bug reports 
until it reached usable stability.


 But if you are such a UTF-16 (actually UCS-2 as

that is what MSEgui supports) fan, why isn't MSEgui source code stored
in UTF-16 encoding either? ;-)


Sure, MSEgui uses utf-8 for external storing and exchanging text data. 
Internal all is 16 bit UnicodeString. Use the best encoding for the 
task. ;-)


 There is good reason why UTF-8 is so

popular. And by the way, a UTF-8 codepoint is only 1-4 bytes in size.


It depends on the specification, seen the parentheses?.

Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Martin Schreiber


Am 21.08.2012 09:32, schrieb Mattias Gaertner:

On Mon, 20 Aug 2012 20:56:46 +0200
Florian Klämpfl  wrote:


[...]
The current situation is:
- either somebody starts to implement support for unicodestring being
utf-8 (or whatever) on linux in a compatible way with the current
approach, then 2.8.0 will use this
- nobody works on it, then 2.8.0 comes with unicodestring=utf-16 always.


IMO unicodestring should be the same on all platforms, because
otherwise the character size switches per platform, which is hard to
test and asking for trouble.


100% agreed.

Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Graeme Geldenhuys

On 21 August 2012 07:10, Ivanko B  wrote:
> How about supporting in the RTL all versions of UCS-2 & UTF-16 (for
> fast per-char access etc optimizations) and UTF-8 (for unlimited
> number of alphabets) ?

All "access a char by index into a string" code I have seen, 99.99% of
the time work in a sequential manner. For that reason there is no
speed difference between using a UTF-16 or UTF-8 encoded string. Both
can be coded equally efficient.

-- 
Regards,
  - Graeme -

___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Mattias Gaertner

On Tue, 21 Aug 2012 13:41:38 +0500
Ivanko B  wrote:

> But if you are such a UTF-16 (actually UCS-2 as
>  that is what MSEgui supports) fan
> =
> If Martin can implement UTF-16 (with surrogate pair) support in MSEgui
> string units (and these units fully cover absenting code of FPC RTL )
> then the things are excellent.
> 
> PS:
> UTF-8 is very-very slow compared to UCS-2 as to string manipulations
> so its best usage is encoding source files (as done in MSEide).

Ivanko, please stop this "slow" non sense. 
Performance heavily depends on what you do and you can find good
examples for almost any Unicode encoding.
At those places where speed matters, you are free to use better
functions in your application.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Aleksa Todorovic

On Tue, Aug 21, 2012 at 10:16 AM, Ivanko B  wrote:
>
> Handling 1..4(6) bytes is less efficient than handling surrogate
>  *pairs*.
> ===
> But surrogate pairs break array-like fast char access anyway,  isn't it ?

It's also "broken" in UTF8 in the same way - so none of them gets +1
on this. UCS4 is the only real winner here (one dword for each
character).






>
> ___
> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> http://lists.freepascal.org/mailman/listinfo/fpc-devel
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Mattias Gaertner

On Tue, 21 Aug 2012 09:23:30 +0100
Graeme Geldenhuys  wrote:

>[...]
> > The conversion is done only when entering and exiting the OS / GUI framework
> > calls. I understand this does not happen too often.
> 
> I beg to differ.

Maybe you can name some example. Concrete problems can be solved,
abstract UTF-8 vs UTF-16 can not.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Michael Schnell


On 08/21/2012 10:15 AM, Graeme Geldenhuys wrote:
You're in for a surprise... With a statement that reads "It provides 
direct access to serial ports, TAPI, and the Microsoft Speech API." it 
should start sounding alarm bells for Linux developers. 
Of course you are very right and silly me did not take these additional 
uses into account.


I only used AsyncPro for avoiding explicit Thread programming when using 
serial interfaces. And just this is a demand I heard very often in 
several FPC related forums.


So I should have stated that these additonal windows-centric uses of 
AsyncPro need to me {$if ed out when not appropriate.


 Direct serial port access is probably very different under Linux too. 
I don't think so. Accessing the port always is similar to accessing a 
file, and I understand a blocking access (which is used in the Threads 
of AsyncPro is always provided.

Supposedly setting the serial port parameters needs some porting.

I have ported (or fixed previous port attempts) of Turbo Power 
products to work x-platform. Turbo Power loved to use Windows API 
calls etc, and that makes porting a non-trivial task. 
I suppose you are right. That is why I did not start porting (the said 
parts of) AsyncPro.
 Not to mention that sometimes things are just easier when you start 
an implementation from scratch

You may be right on  this.

-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Michael Schnell


On 08/21/2012 10:17 AM, Graeme Geldenhuys wrote:
if you want to do string comparisons, one option is to normalise the 
text before you do a compare. 
Other  than the conversion necessary with system-calls when a different 
encoding is used internally, comparing strings happens very often within 
the user code. So the compiler can't force using complex stuff like that 
all over the place.


-Michael

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Michael Schnell


On 08/21/2012 10:32 AM, Mattias Gaertner wrote:
IMO unicodestring should be the same on all platforms, because 
otherwise the character size switches per platform, which is hard to 
test and asking for trouble. 
This does seem appropriate. But right now Delphi comparability forces 16 
Bits and Lazarus forces 8 Bits :( .


-Michael

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Graeme Geldenhuys

Hi,

On 21 August 2012 09:32, Mattias Gaertner  wrote:
>
> IMO unicodestring should be the same on all platforms, because
> otherwise the character size switches per platform,

Please define "character" in your sentence above. Are you referring to
a Unicode codepoint, or a "printable character"? If the first, then 4
bytes is always sufficient on all platforms.

> The compiler already supports an UTF8String, right?
> If yes, then some functions can use UTF8String, some UnicodeString
> (=UTF-16) and the compiler magic will convert automatically.

How I would wish for FPC to stop this ridiculous ambiguity that Delphi
enforces. Can't we just introduce UTF8String and UTF16String types. By
the name they clearly state what encoding the hold.  A UnicodeString
type should mean any Unicode encoding, and defaults to UTF-8 under
*nix type systems and UTF-16 under Windows. Thus no performance loss
on any platform. After all the name "Unicode String" does not imply
UTF-16 only - as per the Unicode Standards.

> The difficult decision is what functions and types should use UTF-8
> and what UTF-16. This may depend on the platform.

As I said, if you use the correct default encoding on each platform
for the UnicodeString type, the problem you mention will not be a
problem any more. Linux will use UTF-8 by default, so file handling
and API was will work without any conversion.

The whole RTL should use UnicodeString type, where the encoding is as
I described above.

> I guess there is no good solution for TStrings. Whatever string type is
> chosen, some programs will suffer.

Why will some suffer? Simply default UnicodeString to the correct
encoding on each platform, and no performance issues and no
unnecessary conversions will occur.

-- 
Regards,
  - Graeme -

___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Graeme Geldenhuys

On 21 August 2012 09:41, Ivanko B  wrote:
> UTF-8 is very-very slow compared to UCS-2 as to string manipulations
> so its best usage is encoding source files (as done in MSEide).

Please supply a test program that proves this. I don't believe you are correct.

I have implemented multiple text edit/display widgets that do plenty
of string manipulation... all based on the UTF-8 encoding. I have
suffered NO speed penalties.

-- 
Regards,
  - Graeme -

___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Graeme Geldenhuys

On 21 August 2012 08:53, Martin Schreiber  wrote:
>>
>> Yet another myth
>
>
> Ehm, I did both. In the beginning MSEgui switched from Widestring to utf-8


Just because you had a bad experience doesn't doom the utf-8 encoding
forever. Maybe you just had a buggy implementation. No coder is
perfect. ;-)


-- 
Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Martin Schreiber


Am 21.08.2012 09:55, schrieb Graeme Geldenhuys:

On 21 August 2012 07:10, Ivanko B  wrote:

How about supporting in the RTL all versions of UCS-2&  UTF-16 (for
fast per-char access etc optimizations) and UTF-8 (for unlimited
number of alphabets) ?


All "access a char by index into a string" code I have seen, 99.99% of
the time work in a sequential manner. For that reason there is no
speed difference between using a UTF-16 or UTF-8 encoded string. Both
can be coded equally efficient.

Graeme, this is simply not true. Searching for known German characters 
in a UnicodeString the program can use the simple approach by character 
(code unit) index. It is even possible for known Chinese symbols of the 
BMP. And a simple "if" for surrogate pairs is more efficent as a 4-stage 
"case" for utf-8.


Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Ivanko B

Performance heavily depends on what you do and you can find good
 examples
==
Hmm.. are there implementations of UTF-8 substringing, string
comparision etc - but not using intermediate HEAVY normalizations
from/to fixed char length type for BOTH input arguments ?
Though me'm sure that latin people don't suffer from slowliness of
utf-8 where utf-8 = ansistring.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Aleksa Todorovic

On Tue, Aug 21, 2012 at 9:53 AM, Martin Schreiber  wrote:
> Am 21.08.2012 09:31, schrieb Graeme Geldenhuys:
>
>
> Ehm, I did both. In the beginning MSEgui switched from Widestring to utf-8
> encoded Ansistring because of the buggy FPC widestring implementation
> (MSEgui started with Delphi/Kylix). Some weeks later I switched back to
> widestring and bite the bullet to write FPC bug reports until it reached
> usable stability.
>
>
>  But if you are such a UTF-16 (actually UCS-2 as
>>
>> that is what MSEgui supports) fan, why isn't MSEgui source code stored
>> in UTF-16 encoding either? ;-)
>
>
> Sure, MSEgui uses utf-8 for external storing and exchanging text data.
> Internal all is 16 bit UnicodeString. Use the best encoding for the task.
> ;-)

+1

There are lot of encodings around, but for different areas of application:
- external text assets could be in any encoding (system-locale
encoding, UTF8, UTF16 both BE and LE - for example, MS Excel export
UTF16 text file)
- Windows system calls are UTF16, on (most of) other platforms UTF8
- input translation (physical keyboard to Unicode character)
- internal application representation is choice of developer

The problem here is that libraries floating around (including RTL and
FCL) use different string types (UnicodeString, UTF8String,
AnsiString), so the question is - is it possible to (re)write those
libraries in a generic way (RawByteString?), so they can work with any
string type?

In my experience, only about 1% of applications requires handling of
individual Unicode characters (input, rendering, GUI text editing).
Other parts of application can happily without that knowledge :-)


Aleksa
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Ivanko B

I have implemented multiple text edit/display widgets that do plenty
 of string manipulation... all based on the UTF-8 encoding. I have
 suffered NO speed penalties.

Sure no problems for GUI. But how about processing large texts ?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Mattias Gaertner

On Tue, 21 Aug 2012 11:09:28 +0200
Michael Schnell  wrote:

> On 08/21/2012 10:32 AM, Mattias Gaertner wrote:
> > IMO unicodestring should be the same on all platforms, because 
> > otherwise the character size switches per platform, which is hard to 
> > test and asking for trouble. 
> This does seem appropriate. But right now Delphi comparability forces 16 
> Bits and Lazarus forces 8 Bits :( .

Lazarus does not force "unicodestring" to anything for the simple
reason, that it does not use it. It only provides some functions for
converting UTF-8 to/from unicodestring.

At the moment Lazarus does not even use UTF8String, because the RTL
does not use it.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Graeme Geldenhuys

On 21 August 2012 10:01, Mattias Gaertner  wrote:
>> > The conversion is done only when entering and exiting the OS / GUI 
>> > framework
>> > calls. I understand this does not happen too often.
>>
>> I beg to differ.
>
> Maybe you can name some example.

OK, lets assume I'm under Linux and fpGUI uses UTF-8 internally
(ideally I would like fpGUI to use the correct encoding on each
platform).

TfpgMemo,LoadFromFile().   TfpgMemo uses a TStringList internally,
which LoadFromFile maps too. Under Linux, filenames are UTF-8 encoded,
so conversion occurs because TStringList uses UTF-16 (as per the FPC
2.7.1 suggestions) for the filename variable, and for all the file
access routines. The file contains UTF-8 encoded text. They are now
converted to UTF-16 text to be stored internally in the TStringList.
fpGUI now needs to display that text, so calls the UTF-8 X11 API's, so
another conversion is required.

This is a simple example, but look at all the conversions already. Now
if UnicodeString uses the correct encoding on each platform, the
conversions would be zero!

-- 
Regards,
  - Graeme -

___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Graeme Geldenhuys

On 21 August 2012 10:16, Ivanko B  wrote:
> Though me'm sure that latin people don't suffer from slowliness of
> utf-8 where utf-8 = ansistring.

And I gather you base your assumptions on MSEgui. MSEgui uses UCS-2,
*not* UTF-16. I also believe MSEgui doesn't bother with surrogate
pairs (please correct me if I am wrong). UTF-16 will also slow down if
UTF-16LE and UTF-16BE is taken into account, and surrogate pairs added
to the mix. How well will your "access char via index" code perform on
that? Or will your code simply be broken then. In my apps where I use
UTF-8 internally, I don't have to worry about any of that, and my
programs will continue ticking as normal, no matter if you I use Plane
0 or Planes 1-16 of the Unicode codepoints.

As I said multiple times, developers love to take shortcuts when it
comes to UCS-2 or UTF-16. They only think BMP and nothing further.
That's not what I consider "supporting Unicode".

-- 
Regards,
  - Graeme -

___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Marco van de Voort

In our previous episode, Mattias Gaertner said:
> 
> IMO unicodestring should be the same on all platforms, because
> otherwise the character size switches per platform, which is hard to
> test and asking for trouble.

I think the big issue is more about what "string" will be when the FPC is
compiled in modes that are now objfpc h+  and delphi.

And then specially anything you override or pass VAR strings too.

> The compiler already supports an UTF8String, right?
> If yes, then some functions can use UTF8String, some UnicodeString
> (=UTF-16) and the compiler magic will convert automatically.

rawbytestring and unicodestring overloaded. See thread in fpc-pascal of a
few days back with subject "rawbytestring".

> The difficult decision is what functions and types should use UTF-8
> and what UTF-16. This may depend on the platform.

The question is if you fixate the classes hierarchy to a certain type on all
platforms, (to avoid problems with virtual/override and VAR) does it make
sense to finely grain divide the RTL over both encoding types.

That stringtype will be so dominant in practice, doing the RTL in a
different stringtype depending on platform won't be as useful.
 
> One problem is that an UTF-8/16 string can contain invalid characters
> making it impossible to convert.
> For example under Linux file names are treated as UTF-8 but are only
> bytes. They can and they do contain invalid UTF-8 characters.
> If your program should support this, you must use a FindFirst
> with UTF-8. To be clear: I don't say the default FindFirst under Linux
> must be UTF-8, I only say, there must be one version with UTF-8, e.g.
> FindFirstU8 and that must directly use the Linux file functions
> without conversions.

That's ugly indeed. Since that doesn't mean just an utf8 overload, but that
the entire internal trajectory behind that (searchrec inclusive) must be
1-byte without conversion. Or the 1-byte to utf16 and back conversion must
be stable.   (invF(F(x))=x

> I guess there is no good solution for TStrings. Whatever string type is
> chosen, some programs will suffer.

tstrings will be "string". So whatever "string" is chosen for the OOP FPC
code (see first paraphraph), that will be the declaration of tstrings.

But D2009 changes many streaming related routines (load/save file/stream) to
add a encoding parameter with some default value. This decouples tstrings
disk format from memory format. Maybe that fixes your worry ?


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Graeme Geldenhuys

On 21 August 2012 10:19, Ivanko B  wrote:
> Sure no problems for GUI. But how about processing large texts ?

Same experience as before. I must add "processing large text" is a
vague statement.

-- 
Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Mattias Gaertner

On Tue, 21 Aug 2012 11:17:24 +0200
Aleksa Todorovic  wrote:

> On Tue, Aug 21, 2012 at 9:53 AM, Martin Schreiber  wrote:
> > Am 21.08.2012 09:31, schrieb Graeme Geldenhuys:
> >
> >
> > Ehm, I did both. In the beginning MSEgui switched from Widestring to utf-8
> > encoded Ansistring because of the buggy FPC widestring implementation
> > (MSEgui started with Delphi/Kylix). Some weeks later I switched back to
> > widestring and bite the bullet to write FPC bug reports until it reached
> > usable stability.
> >
> >
> >  But if you are such a UTF-16 (actually UCS-2 as
> >>
> >> that is what MSEgui supports) fan, why isn't MSEgui source code stored
> >> in UTF-16 encoding either? ;-)
> >
> >
> > Sure, MSEgui uses utf-8 for external storing and exchanging text data.
> > Internal all is 16 bit UnicodeString. Use the best encoding for the task.
> > ;-)
> 
> +1
> 
> There are lot of encodings around, but for different areas of application:
> - external text assets could be in any encoding (system-locale
> encoding, UTF8, UTF16 both BE and LE - for example, MS Excel export
> UTF16 text file)
> - Windows system calls are UTF16, on (most of) other platforms UTF8
> - input translation (physical keyboard to Unicode character)
> - internal application representation is choice of developer
> 
> The problem here is that libraries floating around (including RTL and
> FCL) use different string types (UnicodeString, UTF8String,
> AnsiString), so the question is - is it possible to (re)write those
> libraries in a generic way (RawByteString?), so they can work with any
> string type?

Theoretically you could rewrite the FCL to support UTF8String,
UnicodeString and AnsiString. But not at the same time. In an
application there is always be only one of them. So you have to ship for
each flavor a whole FCL plus all packages that depends on it.
I guess the FPC team wants to support at most one legacy and one
Unicode version. And eventually only the Unicode version.

 
> In my experience, only about 1% of applications requires handling of
> individual Unicode characters (input, rendering, GUI text editing).
> Other parts of application can happily without that knowledge :-)

True.
But that 1% may be scattered around the whole application and there
are no compiler warnings, so it is hard to find all places.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Marco van de Voort

In our previous episode, Mattias Gaertner said:
> > On 08/21/2012 10:32 AM, Mattias Gaertner wrote:
> > > IMO unicodestring should be the same on all platforms, because 
> > > otherwise the character size switches per platform, which is hard to 
> > > test and asking for trouble. 
> > This does seem appropriate. But right now Delphi comparability forces 16 
> > Bits and Lazarus forces 8 Bits :( .
> 
> Lazarus does not force "unicodestring" to anything for the simple
> reason, that it does not use it. It only provides some functions for
> converting UTF-8 to/from unicodestring.
> 
> At the moment Lazarus does not even use UTF8String, because the RTL
> does not use it.

Which reminds me to point you at
http://bugs.freepascal.org/view.php?id=22501
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Mattias Gaertner

On Tue, 21 Aug 2012 11:07:26 +0200
Michael Schnell  wrote:

> On 08/21/2012 10:17 AM, Graeme Geldenhuys wrote:
> > if you want to do string comparisons, one option is to normalise the 
> > text before you do a compare. 
> Other  than the conversion necessary with system-calls when a different 
> encoding is used internally, comparing strings happens very often within 
> the user code. So the compiler can't force using complex stuff like that 
> all over the place.

It does not.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Ivanko B

How well will your "access char via index" code perform on
 that?
=
It'll mean "now is the time to switch to UCS-4" :)
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Ivanko B

For that reason there is no
 speed difference between using a UTF-16 or UTF-8 encoded string. Both
 can be coded equally efficient.
==
No in common, since UTF-8 needs error handling, replacing for
unconvertable bytes etc operations which may effect initial data which
makes per-byte comparision unreliable.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Jonas Maebe



marcov wrote on Tue, 21 Aug 2012:


In our previous episode, Mattias Gaertner said:


For example under Linux file names are treated as UTF-8 but are only
bytes. They can and they do contain invalid UTF-8 characters.
If your program should support this, you must use a FindFirst
with UTF-8. To be clear: I don't say the default FindFirst under Linux
must be UTF-8, I only say, there must be one version with UTF-8, e.g.
FindFirstU8 and that must directly use the Linux file functions
without conversions.


That's ugly indeed. Since that doesn't mean just an utf8 overload,


Since it's just raw bytes, it's actually as much utf-8 as it is  
Windows Latin-1.



but that
the entire internal trajectory behind that (searchrec inclusive) must be
1-byte without conversion. Or the 1-byte to utf16 and back conversion must
be stable.   (invF(F(x))=x


Other frameworks also have to deal with this, and generally have a  
particular default and allow the programmer (and sometimes the end  
user) to override the default behaviour. E.g., glib assumes all file  
names are UTF-8, but you can change this to "assume file names are  
encoded in the current user's locale" or to "assume file names are  
encoded using encoded XYZ" (either programmatically or via an  
environment variable). Qt assumes they are encoded in the current  
user's locale, but the programmer can change this to a different code  
page (no environment variable). In practice, the default Qt and glib  
behaviour is almost always the same on Linux nowadays, since UTF-8  
locales are the default.


I'm not aware of a framework that allows you to say that file names  
are just random bytes. It would probably be possible to implement this  
in FPC by adding "support" for the invalid $ code page (both in  
ansistring and in unicodestring) and never converting anything if that  
one is used (basically overwrite the destination string's codepage  
with $ if it's used by the source). Other options are not  
supporting invalid file names in the cross-platform RTL interface  
(have to use platform-specific APIs to deal with them on platforms  
that "support" such file names, like with glib and Qt), optionally  
adding "raw" overloads of such functions that possibly even accept and  
return arrays of byte rather than strings in order to avoid any  
accidental conversions and to make it clear what you're dealing with.



Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Ivanko B

Me always get excited how Graeme defends the solutions of his choice :)
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Aleksa Todorovic

On Tue, Aug 21, 2012 at 11:41 AM, Mattias Gaertner
 wrote:
>
> Theoretically you could rewrite the FCL to support UTF8String,
> UnicodeString and AnsiString. But not at the same time. In an
> application there is always be only one of them. So you have to ship for
> each flavor a whole FCL plus all packages that depends on it.
> I guess the FPC team wants to support at most one legacy and one
> Unicode version. And eventually only the Unicode version.

Another idea is to use String all over the place as a generic string
which can hold string in any encoding (probably another modeswitch for
this?). So, (in theory) TStrings could store both UTF8 and UTF16
strings at the same time, and compiler magic would do necessary
conversions when needed. Now, since I don't really know much about
compiler internals, I'll try to give it some time and see if above
idea is applicable at all.

>> In my experience, only about 1% of applications requires handling of
>> individual Unicode characters (input, rendering, GUI text editing).
>> Other parts of application can happily without that knowledge :-)
>
> True.
> But that 1% may be scattered around the whole application and there
> are no compiler warnings, so it is hard to find all places.

Yes, they will most probably be scattered all around, but then - it's
developer-related organizational challenge, not compiler one.

Aleksa
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Mattias Gaertner

On Tue, 21 Aug 2012 14:19:44 +0500
Ivanko B  wrote:

> I have implemented multiple text edit/display widgets that do plenty
>  of string manipulation... all based on the UTF-8 encoding. I have
>  suffered NO speed penalties.
> 
> Sure no problems for GUI. But how about processing large texts ?

Especially on large texts UTF-8 can be better, because it needs less
memory and fetching memory pages is expensive. 
I ported the widestring XML units of FPC to UTF-8
because I had to handle thousands of xml files with about 400
MB. Because these documents are in UTF-8 parsing is about 2-3
times faster on these documents, searching is about 20 to 50% faster,
which is pretty much the saved memory pages.
The UTF-8 overhead is not measurable.

Another example are the codetools.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Ivanko B

Because these documents are in UTF-8 parsing is about 2-3
 times faster on these documents, searching is about 20 to 50% faster
=
Because You name is latin ANSISTRING "Mattias Gaertner"  :)  But
Imagine gigabytes of 4 bytes/char UTF-8 text.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Marco van de Voort

In our previous episode, Graeme Geldenhuys said:
> On 21 August 2012 10:19, Ivanko B  wrote:
> > Sure no problems for GUI. But how about processing large texts ?
> 
> Same experience as before. I must add "processing large text" is a
> vague statement.

I think unicode or not is a bigger performance hit than utf8 vs utf16.

All routines like capitalization (routinely used for case insensitve
comparison) get a lot more complicated. Many routines must forfeit
their simple charset loops and will do a call for any  set test.

utf8<->utf16<- any 256 char (ansi) charset conversion operations are fairly
simple and mechanical operations that don't need much context.  They are
probably much cheaper than a single uppercase that we routinely for case
insensitive comparisons.

utf8/16 -> ansi are a bit more involved. (since mapping many chars to few,
naieve implementation requiring large lookupsets)

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Mattias Gaertner

On Tue, 21 Aug 2012 10:24:38 +0100
Graeme Geldenhuys  wrote:

> On 21 August 2012 10:01, Mattias Gaertner  wrote:
> >> > The conversion is done only when entering and exiting the OS / GUI 
> >> > framework
> >> > calls. I understand this does not happen too often.
> >>
> >> I beg to differ.
> >
> > Maybe you can name some example.
> 
> 
> OK, lets assume I'm under Linux and fpGUI uses UTF-8 internally
> (ideally I would like fpGUI to use the correct encoding on each
> platform).
> 
> TfpgMemo,LoadFromFile().   TfpgMemo uses a TStringList internally,
> which LoadFromFile maps too. Under Linux, filenames are UTF-8 encoded,
> so conversion occurs because TStringList uses UTF-16 (as per the FPC
> 2.7.1 suggestions) for the filename variable, and for all the file
> access routines. The file contains UTF-8 encoded text. They are now
> converted to UTF-16 text to be stored internally in the TStringList.
> fpGUI now needs to display that text, so calls the UTF-8 X11 API's, so
> another conversion is required.

The conversion of the file names is negligible compared to the
overhead of the OS itself. Even on Linux.

Converting a text from UTF-8 to UTF-16 does some overhead, but compared
to loading it from disk the overhead is not that big. Normally an
application does not simply load a text, it scans or parses the text.
So for most programs the conversion makes only a few percent.

Painting a string is much more expensive than the conversion.

I agree that TStringList can easily create a performance problem, but
afaik loading a text into a GUI is not a good example to
show conversion overhead.

> This is a simple example, but look at all the conversions already. Now
> if UnicodeString uses the correct encoding on each platform, the
> conversions would be zero!

No. On Windows you have to open UTF-8 files too.

Mattias

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Mattias Gaertner

On Tue, 21 Aug 2012 14:59:57 +0500
Ivanko B  wrote:

> For that reason there is no
>  speed difference between using a UTF-16 or UTF-8 encoded string. Both
>  can be coded equally efficient.
> ==
> No in common, since UTF-8 needs error handling, replacing for
> unconvertable bytes etc operations which may effect initial data which
> makes per-byte comparision unreliable.

For example?

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Graeme Geldenhuys

On 21 August 2012 11:32, Marco van de Voort  wrote:
> All routines like capitalization (routinely used for case insensitve
> comparison) get a lot more complicated.

Obviously Unicode is a lot more complicated, because it is design for
_all_ spoken and non-spoken languages. ASCII is minute in comparison.

Good news is that Unicode can do lots of other things too which could
help performance.
eg: You don't need to attempt to capitalise characters if they are
outside a Unicode category "Letter" and "Uppercase". Or for a Unicode
StrToInt() implementation, simply do a simple "Number" or "Decimal
Digit" category check before you attempt the conversion.

Unicode categories are very useful, and they are simple lookup tables
- and already available for FPC to use (implemented in Object Pascal).

-- 
Regards,
  - Graeme -

___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Mattias Gaertner

On Tue, 21 Aug 2012 15:12:03 +0500
Ivanko B  wrote:

> Because these documents are in UTF-8 parsing is about 2-3
>  times faster on these documents, searching is about 20 to 50% faster
> =
> Because You name is latin ANSISTRING "Mattias Gaertner"  :)

Actually my name is Gärtner.
The texts are mostly Latin, Arabic and Hebrew. They are smaller in
UTF-8 because xml has lots of English tags.

> But Imagine gigabytes of 4 bytes/char UTF-8 text.

I gave you an example where UTF-8 is better. I already wrote there
are good examples for every Unicode encoding, including UTF-16.

Please stop telling UTF-16 is always
faster/better/smaller/safer/whatever than UTF-8.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Graeme Geldenhuys

Hi,

On 21 August 2012 11:45, Mattias Gaertner  wrote:
> I agree that TStringList can easily create a performance problem, but
> afaik loading a text into a GUI is not a good example to
> show conversion overhead.

Maybe so, but it does debunk the statement "does not happen too often".

>> This is a simple example, but look at all the conversions already. Now
>> if UnicodeString uses the correct encoding on each platform, the
>> conversions would be zero!
>
> No. On Windows you have to open UTF-8 files too.

OK, so zero is maybe incorrect. Let change it to 1 conversion (the
file contents only, seeing that just about nobody stores files in
UTF-16 encoding). Now compare 1 conversion to the multiple conversions
under Linux if the RTL is only UTF-16 based.

And as you so clearly stated in a prior message, it depends on what
your application does. Some programs will be heavily penalised by so
many conversions.

-- 
Regards,
  - Graeme -

___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Hans-Peter Diettrich


Aleksa Todorovic schrieb:

On Tue, Aug 21, 2012 at 10:16 AM, Ivanko B  wrote:

Handling 1..4(6) bytes is less efficient than handling surrogate
 *pairs*.
===
But surrogate pairs break array-like fast char access anyway,  isn't it ?


It's also "broken" in UTF8 in the same way - so none of them gets +1
on this. UCS4 is the only real winner here (one dword for each
character).


Depending on the language, ligatures etc. still can span multiple 
codepoints. IMO everybody should decide whether he wants to do text 
processing for full Unicode, or whether simple stringhandling (as used 
till now) is sufficient.


I never heard that non-canoncial text has caused problems in character 
sets with accents or umlauts - except in (MacOS, Linux) filenames. Since 
file searches have to use the platform API, all required special 
handling can be encapsulated in the RTL.


Breaking strings into substrings can be done on specific delimiters 
(spaces...), which are all ASCII, again no complication with UTF.  A 
comparison or search for given patterns also is insensitive to the 
encoding. Where would one really need indexed access to single characters?


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Hans-Peter Diettrich


Martin Schreiber schrieb:


All "access a char by index into a string" code I have seen, 99.99% of
the time work in a sequential manner. For that reason there is no
speed difference between using a UTF-16 or UTF-8 encoded string. Both
can be coded equally efficient.

Graeme, this is simply not true. Searching for known German characters 
in a UnicodeString the program can use the simple approach by character 
(code unit) index. It is even possible for known Chinese symbols of the 
BMP. And a simple "if" for surrogate pairs is more efficent as a 4-stage 
"case" for utf-8.


The good ole Pos() can do that, why search for more complicated 
implementations?


You still try to use old coding patterns which are simply inappropriate 
for dealing with Unicode strings. Why make a distinction between 
searching for a single character or multiple characters, when it's known 
that one character can require multiple bytes or words in UTF-8/16?


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Hans-Peter Diettrich


Aleksa Todorovic schrieb:


The problem here is that libraries floating around (including RTL and
FCL) use different string types (UnicodeString, UTF8String,
AnsiString), so the question is - is it possible to (re)write those
libraries in a generic way (RawByteString?), so they can work with any
string type?


RawByteStrings are not a solution, because the specific encoding of 
every parameter has to be checked inside such a subroutine, and specific 
handling of all encodings has to be implemented there. It's easier to 
expect strings of a specific encoding, and let the compiler insert 
eventual conversions automatically. Then overloaded subroutines can be 
provided, for the 3 common encodings (CP_ACP, CP_UTF8 and UTF-16).


What I'm missing is a true binary string encoding, which never is 
subject to automatic conversions. Why implement TBytes or similar 
classes for data buffers, with zero comfort compared to standard 
stringhangling operators/functions?


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Hans-Peter Diettrich


Ivanko B schrieb:

For that reason there is no
 speed difference between using a UTF-16 or UTF-8 encoded string. Both
 can be coded equally efficient.
==
No in common, since UTF-8 needs error handling, replacing for
unconvertable bytes etc operations which may effect initial data which
makes per-byte comparision unreliable.


When dealing with floating point values you don't bother with their 
encoding of sign, exponent and mantissa. Why do you want to do such 
low-level bitfiddling with strings?


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Hans-Peter Diettrich


Graeme Geldenhuys schrieb:

On 20 August 2012 23:18, Hans-Peter Diettrich  wrote:

The Delphi developers wanted to implement what you suggest, but dropped that
approach later again.


When Embarcadero implemented Unicode support, Delphi was a pure
Windows application. They had no need to think of anything other than
what Windows supports.


So what? The poor performance of an variable char-size string type is 
not related to any platform.




A character type is somewhat useless, unless all strings are UTF-32 (what's
quite unlikely now). Instead substrings should be used, which can contain
any number of bytes or characters.


I guess that depends on how you define the Char type. Is it meant to
hold a single Unicode codepoint, or a single printable character. If
the latter, then probably a bigger Char type is required.


A string can contain any number of characters, including zero. Why make 
a distinction between handling a single character from handling multiple 
characters? An UTF-32 Char type will require implicit conversion into an 
string, before it can be used with strings of any other encoding. Not 
very efficient, indeed :-(




You also have to explain what String[4] means in an Unicode environment.


The String[] syntax in Object Pascal means you are defining a
shortstring type (irrespective of compiler mode), thus an array of
bytes. In this case 4-bytes are used to hold any Unicode codepoint.


Why abuse an ShortString type, when any ordinal 4-byte value will do the 
same? Did you consider that ShortStrings deserve special handling, WRT 
e.g. their Length field? The 5 bytes in memory also don't fit nicely 
into an aligned memory layout, and the compiler may insert range 
checking and other useless code. When ordinary ShortStrings have their 
own fixed encoding (CP_ACP?), you'll have to tell the compiler to ignore 
all that when dealing with your Char=String[4] type :-(



Q: Did you ever read about the new string implementation of FPC?


I have read some of the message threads that went around in fpc-devel,
I also worked on the cp branch before it was merged with Trunk. If you
have any other "documentation" in mind, please post the URL and I'll
happily take a look.


Then read it again, you seem to have missed essential points.

DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Mattias Gaertner

On Tue, 21 Aug 2012 12:09:52 +0100
Graeme Geldenhuys  wrote:

>[...]
> >> This is a simple example, but look at all the conversions already. Now
> >> if UnicodeString uses the correct encoding on each platform, the
> >> conversions would be zero!
> >
> > No. On Windows you have to open UTF-8 files too.
> 
> OK, so zero is maybe incorrect. Let change it to 1 conversion (the
> file contents only, seeing that just about nobody stores files in
> UTF-16 encoding). Now compare 1 conversion to the multiple conversions
> under Linux if the RTL is only UTF-16 based.

True. But let's be realistic. Some conversions are not measurable
and are ok. Some can be avoided by simple changes to the application.

> And as you so clearly stated in a prior message, it depends on what
> your application does. Some programs will be heavily penalised by so
> many conversions.

Yes. But maybe these applications can be adapted easily.
This discussion should be about the issues where the conversions
matter and there is no simple workaround.
It would be good if everyone who knows such a problem comes up with it
now, so the FPC team can give an advice and/or consider it in the
Unicode RTL.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Michael Schnell


On 08/21/2012 11:09 AM, Graeme Geldenhuys wrote:
Can't we just introduce UTF8String and UTF16String types. By the name 
they clearly state what encoding the hold.


It does make sense to (optionally) provide dynamically encoded strings, 
so that it is possible to do library functions that work with any 
encoding (and can be called with any encoding) without the need to do 
overloaded functions or do conversions under the hood.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Jonas Maebe



Mattias Gaertner wrote on Tue, 21 Aug 2012:


But let's be realistic. Some conversions are not measurable
and are ok.


Case in point: the FPC Win32 RTL until now. It always uses the  
ansistring versions of OS interface functions, while NT-based Windows  
OSes internally all work with UTF-16. This obviously has caused  
problems due to unrepresentable file names, but in terms of performance?



Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Michael Schnell


On 08/21/2012 01:09 PM, Graeme Geldenhuys wrote:


Maybe so, but it does debunk the statement "does not happen too often".
With "not so often" I meant program runtime:  it is usually not called 
in a close long running loop.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Martin Schreiber


Am 21.08.2012 12:52, schrieb Hans-Peter Diettrich:


The good ole Pos() can do that, why search for more complicated
implementations?

You still try to use old coding patterns which are simply inappropriate
for dealing with Unicode strings. Why make a distinction between
searching for a single character or multiple characters, when it's known
that one character can require multiple bytes or words in UTF-8/16?

I wrote "known German characters" and "known Chinese symbols of the BMP" 
for example character constants. If you want to read some examples of 
problems with utf-8 especially for pupils and Pascal beginners read the 
German Lazarus Forum or freepascal.ru. Why should we design programming 
so that it complicates the work for them? Anyway, I don't care, do what 
you want but please implement Unicode resource strings in FPC compiler.


Thanks, Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Ludo Brands


> 
> Yes. But maybe these applications can be adapted easily.
> This discussion should be about the issues where the 
> conversions matter and there is no simple workaround. It 
> would be good if everyone who knows such a problem comes up 
> with it now, so the FPC team can give an advice and/or 
> consider it in the Unicode RTL.
> 

There is the large category of network apps. Most protocols are utf8 or have
a clear preference for utf8 (json for example). Databases are an extension
of that and have the additional complication that they can mix codepages at
any level. These apps can be quite sensitive to conversion overhead. 

Ludo

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Michael Schnell


On 08/21/2012 12:02 PM, Aleksa Todorovic wrote:
Yes, they will most probably be scattered all around, but then - it's 
developer-related organizational challenge, not compiler one.
The compiler should not in a large area produce code that does not work 
as a former version (that did not use Unicode). So  maybe it should not 
compile myString[i] at all which will not work as expected in most cases 
(at least with UTF8 and Unicode agnostic users).


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Michael Schnell


On 08/21/2012 02:11 PM, Michael Schnell wrote:

So maybe it should not compile myString[i] at all


... and provide a decent enumerator syntax instead.

-Michael

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Michael Schnell


On 08/21/2012 11:22 AM, Mattias Gaertner wrote:


Lazarus does not force "unicodestring" to anything for the simple
reason, that it does not use it. It only provides some functions for
converting UTF-8 to/from unicodestring.

At the moment Lazarus does not even use UTF8String, because the RTL
does not use it.

AFAIK, all the GUI related LCL calls use UTF8 in the parameters and 
results. OTOH all Delphi VCL GUI calls use 16 bit string encoding in the 
parameters and results.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Mattias Gaertner

On Tue, 21 Aug 2012 14:22:17 +0200
Michael Schnell  wrote:

> On 08/21/2012 11:22 AM, Mattias Gaertner wrote:
> >
> > Lazarus does not force "unicodestring" to anything for the simple
> > reason, that it does not use it. It only provides some functions for
> > converting UTF-8 to/from unicodestring.
> >
> > At the moment Lazarus does not even use UTF8String, because the RTL
> > does not use it.
> >
> AFAIK, all the GUI related LCL calls use UTF8 in the parameters and 
> results. OTOH all Delphi VCL GUI calls use 16 bit string encoding in the 
> parameters and results.

The VCL uses the same string as the Delphi RTL classes. 
Formerly the encoding was MS Windows codepage, now it is UTF-16.

The LCL uses the same string as the FCL classes. 
The FCL uses 8-bit strings without forcing an encoding, except for file
names. The LCL expects UTF-8 and provides UTF-8 file functions.
If the FCL moves to another string or starts enforcing an encoding the
LCL has to be adapted.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Graeme Geldenhuys

On 21 August 2012 13:03, Michael Schnell  wrote:
> With "not so often" I meant program runtime:  it is usually not called in a
> close long running loop.

I have a program that does exactly that... Loads files to do CRC
checking to see what changed. It's a recursive find-all that goes
through 100k or more files. It's already a slow process on non-SSD
drives processing 12GB or more of data, so adding the multiple
unnecessary string conversions which will be enforced on Linux users
would make that even worse. For me, every optimisation counts.

With some simple trials in various projects I can clearly see a
Unicode RTL with one string type, and native encoding on each platform
as very plausible. So why the resistance to NOT implement something as
efficient as that??? Such resistance by the FPC project is what
baffles my mind, and why I sometime agree with Martin Schreiber's
ideas fork everything and just implement your own damn RTL.

[just finished reading a new Embarcadero blog post about post-XE3 development]
Embarcadero is pretty much dropping the desktop scene (no future there
according to them) and will primarily cater only for mobile
development (iOS and Android). A new compiler, new linker, new
toolkit, new RTL, new debugger. They know full well it will be a
"complete code breaking change again". Oooh, I wonder how FPC and
Lazarus is going to "clone" that? Does everybody still want to keep
following Delphi?

 http://blogs.embarcadero.com/jtembarcadero/2012/08/20/xe3-and-beyond/

-- 
Regards,
  - Graeme -

___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Mattias Gaertner

On Tue, 21 Aug 2012 14:05:33 +0200
"Ludo Brands"  wrote:

> 
> > 
> > Yes. But maybe these applications can be adapted easily.
> > This discussion should be about the issues where the 
> > conversions matter and there is no simple workaround. It 
> > would be good if everyone who knows such a problem comes up 
> > with it now, so the FPC team can give an advice and/or 
> > consider it in the Unicode RTL.
> > 
> 
> There is the large category of network apps. Most protocols are utf8 or have
> a clear preference for utf8 (json for example). Databases are an extension
> of that and have the additional complication that they can mix codepages at
> any level. These apps can be quite sensitive to conversion overhead. 

Well, without more details the advice is probably to use UTF8String.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Mattias Gaertner

On Tue, 21 Aug 2012 13:53:14 +0100
Graeme Geldenhuys  wrote:

> On 21 August 2012 13:03, Michael Schnell  wrote:
> > With "not so often" I meant program runtime:  it is usually not called in a
> > close long running loop.
> 
> I have a program that does exactly that... Loads files to do CRC
> checking to see what changed. It's a recursive find-all that goes
> through 100k or more files. It's already a slow process on non-SSD
> drives processing 12GB or more of data, so adding the multiple
> unnecessary string conversions which will be enforced on Linux users
> would make that even worse. For me, every optimisation counts.

Then you would not use TStrings in the first place.

 
> With some simple trials in various projects I can clearly see a
> Unicode RTL with one string type, and native encoding on each platform
> as very plausible. 

One string type and native encoding. Do you mean the current AnsiString?

I guess you mean UTF-16/UTF-8 depending on platform. That would be
different character sizes, which means lots of IFDEFs in users code.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Ivanko B

For example?
==
Sometime reading directory/file names. Sometime PostgreSQL produces
UTF-8 dumps with errors causing problems to converting to single byte
encoding (KOI8-R) - me have to use the "-c" switch of ICONV for such
conversions. Really not seldom errors, but You (latins) are just
unaware of them.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Marcos Douglas

On Tue, Aug 21, 2012 at 6:09 AM, Graeme Geldenhuys
 wrote:
> Hi,
>
> On 21 August 2012 09:32, Mattias Gaertner  wrote:
>>
>> IMO unicodestring should be the same on all platforms, because
>> otherwise the character size switches per platform,
>
>
> Please define "character" in your sentence above. Are you referring to
> a Unicode codepoint, or a "printable character"? If the first, then 4
> bytes is always sufficient on all platforms.
>
>> The compiler already supports an UTF8String, right?
>> If yes, then some functions can use UTF8String, some UnicodeString
>> (=UTF-16) and the compiler magic will convert automatically.
>
> How I would wish for FPC to stop this ridiculous ambiguity that Delphi
> enforces. Can't we just introduce UTF8String and UTF16String types. By
> the name they clearly state what encoding the hold.  A UnicodeString
> type should mean any Unicode encoding, and defaults to UTF-8 under
> *nix type systems and UTF-16 under Windows. Thus no performance loss
> on any platform. After all the name "Unicode String" does not imply
> UTF-16 only - as per the Unicode Standards.
>
>
>> The difficult decision is what functions and types should use UTF-8
>> and what UTF-16. This may depend on the platform.
>
> As I said, if you use the correct default encoding on each platform
> for the UnicodeString type, the problem you mention will not be a
> problem any more. Linux will use UTF-8 by default, so file handling
> and API was will work without any conversion.
>
> The whole RTL should use UnicodeString type, where the encoding is as
> I described above.
>
>
>> I guess there is no good solution for TStrings. Whatever string type is
>> chosen, some programs will suffer.
>
> Why will some suffer? Simply default UnicodeString to the correct
> encoding on each platform, and no performance issues and no
> unnecessary conversions will occur.

Make much sense and AFAIK so far no one has said why this approach
would not work.

Marcos Douglas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Ludo Brands

 
> > There is the large category of network apps. Most protocols 
> are utf8 
> > or have a clear preference for utf8 (json for example). 
> Databases are 
> > an extension of that and have the additional complication that they 
> > can mix codepages at any level. These apps can be quite 
> sensitive to 
> > conversion overhead.
> 
> Well, without more details the advice is probably to use UTF8String.
> 

A more detailed example then. A web application that fills in HTML templates
with variable data coming from fe. a database or whatever. HTML is all
ASCII. So parsing an iso-8859-1 or UTF8 template and making ASCII tag
substitutions in both CP is exactly the same. The ascii uppercase works nice
in both and tags are case insensitive at virtually no cost. The problem
starts when a string is supposed to have a codepage and conversions are made
before functions like concatinating strings, uppercase, pos, etc. See
http://bugs.freepascal.org/view.php?id=22501. Detecting code page of the
template and setting the string cp accordingly? Detecting code pages can be
quite expensive.
Even in the utf8 only case, converting all to utf16 to do some basic string
manipulations as suggested can lead quickly to bottlenecks for such basic
string manipulations in high volume web servers. I understand one can not
make an rtl for every code page but the question was to list application
areas where string conversions could be important or critical. I'm not
pushing one or the other solution;)

Ludo  

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Graeme Geldenhuys

On 21 August 2012 14:13, Mattias Gaertner  wrote:
> One string type and native encoding. Do you mean the current AnsiString?

I meant a string type that changes it's encoding based on the platform
it is compiled for. UTF-16 under Windows, UTF-8 under others. The RTL
then uses that sinle string type throughout.

The Char type would be defined as String[4]  (max size in bytes of a
unicode codepoint)


> I guess you mean UTF-16/UTF-8 depending on platform. That would be
> different character sizes, which means lots of IFDEFs in users code.

Why? Can you give an example where IFDEF's would be required?


-- 
Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Mattias Gaertner

On Tue, 21 Aug 2012 18:15:07 +0500
Ivanko B  wrote:

> For example?
> ==
> Sometime reading directory/file names. Sometime PostgreSQL produces
> UTF-8 dumps with errors causing problems to converting to single byte
> encoding (KOI8-R) - me have to use the "-c" switch of ICONV for such
> conversions. Really not seldom errors, but You (latins) are just
> unaware of them.

Ivanko, your mailer replied this mail to another mail of this thread,
which is somewhat confusing. Then you kept only the last
line of the text you replied to and now it is totally confusing.

I guess the "For example?" is from my mail:

> > For that reason there is no
> >  speed difference between using a UTF-16 or UTF-8 encoded string. Both
> >  can be coded equally efficient.
> > ==
> > No in common, since UTF-8 needs error handling, replacing for
> > unconvertable bytes etc operations which may effect initial data which
> > makes per-byte comparision unreliable.  
>
> For example?

If you replied to this mail then you lost me.
I don't understand what problem of UTF-8 for the RTL you want to point
out. Can you explain again?

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Marco van de Voort

In our previous episode, Graeme Geldenhuys said:
> The Char type would be defined as String[4]  (max size in bytes of a
> unicode codepoint)

Doesn't sound wise.  length(stringtype)=n should mean that the string takes
sizeof(char)*n bytes. (give or take the #0#0)


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Graeme Geldenhuys

On 21 August 2012 14:54, Marco van de Voort  wrote:
>
> Doesn't sound wise.  length(stringtype)=n should mean that the string takes
> sizeof(char)*n bytes. (give or take the #0#0)

I'm not sure what you are trying to accomplish? Give me sample code
that will cause a problem.

In fpGUI I have UTF8Length(mystring) which returns the actual number
of code points used - not bytes used. If you want the number of bytes
used, simply use Length(mystring).  Use each of those at appropriate
times based on what you want to accomplish.

The RTL Length() function has been the source of lots of confusion to
Delphi and FPC developers. So without an actual use-case I don't know
what you are trying to do.

-- 
Regards,
  - Graeme -

___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Mattias Gaertner

On Tue, 21 Aug 2012 10:23:10 -0300
Marcos Douglas  wrote:

>[...]
> >> I guess there is no good solution for TStrings. Whatever string type is
> >> chosen, some programs will suffer.
> >
> > Why will some suffer? Simply default UnicodeString to the correct
> > encoding on each platform, and no performance issues and no
> > unnecessary conversions will occur.
> 
> Make much sense and AFAIK so far no one has said why this approach
> would not work.

First of all: How do you define the "native encoding"?
The encoding of the file system? Ext4 does not have one, HFS+ knows
only normalized UTF-8.
The encoding of readln/writeln? That depends on environment options
rather than OS.
The Unicode functions of the system libraries? Most applications
rarely use them, but rather use frameworks like fpGui, MSEGui or the
LCL. Linux does not have such functions.
Make a poll? Then you should not call it "native encoding".

Second: Why is it bad to have a platform dependent string type?
- More test work. It's not sufficient any more to test a string
  function on one platform, you have to test it on all platforms. That's
  especially hard for projects where some developers have only access
  to a few OS.
- Harder to optimize. It's easy to write a few optimized functions for
  one encoding. With multiple encodings you have to write multiple
  versions.
- Loss of simplification. Some things are easy in one encoding, some
  are not. Because you have to support all encodings, you have to
  always implement the difficult encoding too.


Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Mattias Gaertner

On Tue, 21 Aug 2012 15:11:56 +0100
Graeme Geldenhuys  wrote:

> On 21 August 2012 14:54, Marco van de Voort  wrote:
> >
> > Doesn't sound wise.  length(stringtype)=n should mean that the string takes
> > sizeof(char)*n bytes. (give or take the #0#0)
> 
> 
> I'm not sure what you are trying to accomplish? Give me sample code
> that will cause a problem.
> 
> In fpGUI I have UTF8Length(mystring) which returns the actual number
> of code points used - not bytes used. If you want the number of bytes
> used, simply use Length(mystring).  Use each of those at appropriate
> times based on what you want to accomplish.

length returns the number of characters.
UTF8Length the number of codepoints.
There must also be a function to return the number of bytes.
Does someone know the name?

 
> The RTL Length() function has been the source of lots of confusion to
> Delphi and FPC developers. So without an actual use-case I don't know
> what you are trying to do.


Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Ivanko B

If you replied to this mail then you lost me.
 I don't understand what problem of UTF-8 for the RTL you want to point
 out. Can you explain again?
==
Substringing etc manipulation only via normalizing to fixed-char type
which may be inefficient (especially because it performs for each
input argument & also for output - overhead multiplied by 3).
The ideal might be optimized (without pre/post-normalization) string
RTL with same set of procedures & functions & string related classes
for UTF-8, USC-2 & possibly UCS-4 or UTF-16 with working assignments
between them.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Mattias Gaertner

On Tue, 21 Aug 2012 19:48:12 +0500
Ivanko B  wrote:

> If you replied to this mail then you lost me.
>  I don't understand what problem of UTF-8 for the RTL you want to point
>  out. Can you explain again?
> ==
> Substringing etc manipulation only via normalizing to fixed-char type
> which may be inefficient (especially because it performs for each
> input argument & also for output - overhead multiplied by 3).
> The ideal might be optimized (without pre/post-normalization) string
> RTL with same set of procedures & functions & string related classes
> for UTF-8, USC-2 & possibly UCS-4 or UTF-16 with working assignments
> between them.

Do you mean replacing a character in an UCS-2/UCS-4 string can be
implemented more efficiently than in an UTF-8/UTF-16 string?


Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Mattias Gaertner

On Tue, 21 Aug 2012 15:38:31 +0200
"Ludo Brands"  wrote:

>  
> > > There is the large category of network apps. Most protocols 
> > are utf8 
> > > or have a clear preference for utf8 (json for example). 
> > Databases are 
> > > an extension of that and have the additional complication that they 
> > > can mix codepages at any level. These apps can be quite 
> > sensitive to 
> > > conversion overhead.
> > 
> > Well, without more details the advice is probably to use UTF8String.
> > 
> 
> A more detailed example then. A web application that fills in HTML templates
> with variable data coming from fe. a database or whatever. HTML is all
> ASCII. So parsing an iso-8859-1 or UTF8 template and making ASCII tag
> substitutions in both CP is exactly the same. The ascii uppercase works nice
> in both and tags are case insensitive at virtually no cost. The problem
> starts when a string is supposed to have a codepage and conversions are made
> before functions like concatinating strings, uppercase, pos, etc. See
> http://bugs.freepascal.org/view.php?id=22501.

Bug 22501 is about string constants and mismatch of CPs.
Note that this is a different beast than dynamic data coming
from files, sockets or db.


> Detecting code page of the
> template and setting the string cp accordingly? Detecting code pages can be
> quite expensive.

And it is often impossible.

> Even in the utf8 only case, converting all to utf16 to do some basic string
> manipulations as suggested can lead quickly to bottlenecks for such basic
> string manipulations in high volume web servers. I understand one can not
> make an rtl for every code page but the question was to list application
> areas where string conversions could be important or critical. I'm not
> pushing one or the other solution;)

It's about string conversions that are critical and hard to fix.
I have not doubt that changing the string type means that some
functions will become slow. But it does not mean they are hard to fix.
For example you could change the string type of the time critical
strings to UTF8String to make sure that the big strings are never
converted.


Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Sven Barth


Am 21.08.2012 16:44, schrieb Mattias Gaertner:

On Tue, 21 Aug 2012 15:11:56 +0100
Graeme Geldenhuys  wrote:


On 21 August 2012 14:54, Marco van de Voort  wrote:


Doesn't sound wise.  length(stringtype)=n should mean that the string takes
sizeof(char)*n bytes. (give or take the #0#0)



I'm not sure what you are trying to accomplish? Give me sample code
that will cause a problem.

In fpGUI I have UTF8Length(mystring) which returns the actual number
of code points used - not bytes used. If you want the number of bytes
used, simply use Length(mystring).  Use each of those at appropriate
times based on what you want to accomplish.


length returns the number of characters.
UTF8Length the number of codepoints.
There must also be a function to return the number of bytes.
Does someone know the name?


Length(s) * SizeOf(s[1])

Regards,
Sven

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Paul Ishenin


21.08.12, 23:21, Sven Barth пишет:


There must also be a function to return the number of bytes.
Does someone know the name?


Length(s) * SizeOf(s[1])


It has the name ByteLength()

Best regards,
Paul Ishenin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Mattias Gaertner

On Tue, 21 Aug 2012 17:21:27 +0200
Sven Barth  wrote:

>[...]
> > length returns the number of characters.
> > UTF8Length the number of codepoints.
> > There must also be a function to return the number of bytes.
> > Does someone know the name?
> 
> Length(s) * SizeOf(s[1])

Cheater. ;)

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Sven Barth


Am 21.08.2012 17:27, schrieb Paul Ishenin:

21.08.12, 23:21, Sven Barth пишет:


There must also be a function to return the number of bytes.
Does someone know the name?


Length(s) * SizeOf(s[1])


It has the name ByteLength()


O.o

Again what learned...

Regards,
Sven

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] FPC -Rintel and -alr options

2012-08-21 Thread ABorka


Yes, you are right.
I just comment out the options within fpc.cfg after I got the asm files 
(*.s) for study.



On 8/21/2012 00:53, Sven Barth wrote:

Am 21.08.2012 09:35, schrieb ABorka:

This is exactly what I needed.
"-alr -sr -Amasm" does it. I just put them into my "fpc.cfg" .


Why did you put this into your fpc.cfg? You are aware that with the "-s"
switch no binary code is generated? Or are you protecting that with an
IFDEF?

Regards,
Sven

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Hans-Peter Diettrich


Marco van de Voort schrieb:


utf8/16 -> ansi are a bit more involved. (since mapping many chars to few,
naieve implementation requiring large lookupsets)


A single 256 element array can be used for both directions. In Ansi to 
Unicode the char value is used to index the array of Unicode values, 
otherwise the given Unicode value is searched in the array.


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Hans-Peter Diettrich


Mattias Gaertner schrieb:


length returns the number of characters.

the number of elements, which can be of any size (in arrays in general).


UTF8Length the number of codepoints.
There must also be a function to return the number of bytes.
Does someone know the name?


Length(s)*sizeof(s[1])

DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode resource strings

2012-08-21 Thread Hans-Peter Diettrich


Graeme Geldenhuys schrieb:

On 21 August 2012 13:03, Michael Schnell  wrote:

With "not so often" I meant program runtime:  it is usually not called in a
close long running loop.


I have a program that does exactly that... Loads files to do CRC
checking to see what changed. It's a recursive find-all that goes
through 100k or more files. It's already a slow process on non-SSD
drives processing 12GB or more of data, so adding the multiple
unnecessary string conversions which will be enforced on Linux users
would make that even worse.


IMO string conversion and CRC are mutually exclusive.

DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Hans-Peter Diettrich


Martin Schreiber schrieb:

Am 21.08.2012 12:52, schrieb Hans-Peter Diettrich:


The good ole Pos() can do that, why search for more complicated
implementations?

You still try to use old coding patterns which are simply inappropriate
for dealing with Unicode strings. Why make a distinction between
searching for a single character or multiple characters, when it's known
that one character can require multiple bytes or words in UTF-8/16?

I wrote "known German characters" and "known Chinese symbols of the BMP" 
for example character constants. If you want to read some examples of 
problems with utf-8 especially for pupils and Pascal beginners read the 
German Lazarus Forum or freepascal.ru. Why should we design programming 
so that it complicates the work for them? Anyway, I don't care, do what 
you want but please implement Unicode resource strings in FPC compiler.


You still miss the point. Why deal with single characters, by index, 
when working with substrings also covers the single-character use?


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Ivanko B

Why deal with single characters, by index, when working with
substrings also covers the single-character use?

Possibly because it tens times as slower for multiple chars processed.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Ivanko B

> Do you mean replacing a character in an UCS-2/UCS-4 string can be
> implemented more efficiently than in an UTF-8/UTF-16 string?
>

Sure, just scan the string char by char as array elements and replace
as matches encounter. Like working with integer arrays.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

1 2 >