Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-23 Thread Vincent Snijders
2012/8/23 Hans-Peter Diettrich drdiettri...@aol.com:
 Daniël Mantione schrieb:

 Op Wed, 22 Aug 2012, schreef Felipe Monteiro de Carvalho:

 On Wed, Aug 22, 2012 at 9:36 PM, Martin Schreiber mse00...@gmail.com
 wrote:

 I am not talking about Unicode. I am talking about day by day
 programming of
 an average programmer where the live is easier with utf-16 than with
 utf-8.
 Unicode is not done by using pos() instead of character indexes.
 I think everybody knows my opinion, I stop now.


 Please be clear in the terminogy. Don't say live is easier with
 utf-16 than with utf-8 if you don't mean utf-16 as it is. Just say
 live is easier with ucs-2 than with utf-8, then everything is clear
 that you are talking about ucs2 and not true utf-16.


 That is nonsense.

 * There are no whitespace characters beyond widechar range. This means you
   can write a routine to split a string into words without bothing about
   surrogate pairs and remain fully UTF-16 compliant.


 How is this different for UTF-8?


There are white space charaters beyond the char range, for example
U+00A0 no-break space.

So in UTF8 a white space character can be larger than 1 byte, in
UTF-16 they are all 2 bytes. That is the difference.

Vincent
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-23 Thread Daniël Mantione



Op Thu, 23 Aug 2012, schreef Hans-Peter Diettrich:


Daniël Mantione schrieb:

Op Wed, 22 Aug 2012, schreef Felipe Monteiro de Carvalho:

On Wed, Aug 22, 2012 at 9:36 PM, Martin Schreiber mse00...@gmail.com 
wrote:
I am not talking about Unicode. I am talking about day by day programming 
of
an average programmer where the live is easier with utf-16 than with 
utf-8.

Unicode is not done by using pos() instead of character indexes.
I think everybody knows my opinion, I stop now.


Please be clear in the terminogy. Don't say live is easier with
utf-16 than with utf-8 if you don't mean utf-16 as it is. Just say
live is easier with ucs-2 than with utf-8, then everything is clear
that you are talking about ucs2 and not true utf-16.


That is nonsense.

* There are no whitespace characters beyond widechar range. This means you
  can write a routine to split a string into words without bothing about
  surrogate pairs and remain fully UTF-16 compliant.


How is this different for UTF-8?


Your answer exactly demonstrates how UTF-16 can result in better Unicode 
support: You probably consider the space the only white-space character 
and would have written code that only handles the space. In Unicode you 
have the space, the non-breaking space, the half-space and probably a few 
more that I am missing.



* There are no characters with uppper/lowercase beyond widechar range.
  That means if you write cade that deals with character case you don't
  need to bother with surrogate pairs and still remain fully UTF-16
  complaint.


How expensive is a Unicode Upper/LowerCase conversion per se?


I'd expect a conversion would be quite a bit faster in UTF-16, as can be a 
table lookup per character rather than a decode/re-encode per character.
But it's not about conversion per se, everyday code deals with 
character case in a lot more situations.



* You can group Korean letters into Korean syllables, again without
  bothering about surrogate pairs, as Korean is one of the many languages
  that is entirely in widechar range.


The same applies to English and UTF-8 ;-)
Selected languages can be handled in special ways, but not all.


I'd disagree, because there are quite a few codepoints that can be used 
for English texts beyond #128, like i.e. currency symbols, or ligatures, 
but suppose I'd follow your reasoning, the list of languages your 
Unicode aware software will handle properly is:


* English

If are interrested in proper multi-lingual support... you won't get very 
far. In UTF-16 only few of the 6000 languages in the world need 
codepoints beyond the basic multi-lingual plane. In other words you get very far.


You mentioned Korean syllables splitting - is this a task occuring often in 
Korean programs?


Yes, in Korean this is very important, because Korean letters are written 
in syllables, so it's a very common conversion. There are both Unicode 
points for letters and for syllables.


For example people when people type letters on the keyboard, you 
receive the letter unicode points. If you send those directly to the 
screen you see the individual letters; that's not correct Korean writing, 
you want to convert to syllables and send the Unicode points for syllables 
to the screen.


At the begin of computer-based publishing most German texts were hard to 
read, due to many wordbreak errors.


In western-languages, syllables are only important for word-breaks and our 
publishing software contains advanced syllable splitting algorithms. You'd 
better not use that code for Korean texts, because there exists no need to 
break words in that script.


In general... different language, different text processing algorithms...

But another point becomes *really* important, when libraries with 
beforementioned Unicode functions are used: The application and libraries 
should use the *same* string encoding, to prevent frequent conversions with 
every function call. This suggests to use the library(=platform) specific 
string encoding, which can be different on e.g. Windows and Linux.


Consequently a cross-platform program should be as insensitive as possible to 
encodings, and the whole UTF-8/16 discussion turns out to be purely academic. 
This leads again to an different issue: should we declare an string type 
dedicated to Unicode text processing, which can vary depending on the 
platform/library encoding? Then everybody can decide whether to use one 
string type (RTL/FCL/LCL compatible) for general tasks, and the library 
compatible type for text processing?


No disagreement here, if all your libraries are UTF-8, you don't want to 
convert everything. So if possible, write code to be as string type 
agnostic.


Sometimes, however, you do need to look inside a string, and it does 
help to have an easy encoding then.


Or should we bite the bullet and support different flavors of the FPC 
libraries, for best performance on any platform? This would also leave it to 
the user to select his preferred encoding, 

Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-22 Thread Marco van de Voort
In our previous episode, Ivanko B said:
  Do you mean replacing a character in an UCS-2/UCS-4 string can be
  implemented more efficiently than in an UTF-8/UTF-16 string?
 
 
 Sure, just scan the string char by char as array elements and replace
 as matches encounter. Like working with integer arrays.

The scanning is not what is expensive. The change of the match is.

In both cases you need to reconstruct at least a 32-bit codepoint and match
that against other codepoints in some datastructure.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-22 Thread Mattias Gaertner
On Wed, 22 Aug 2012 09:34:33 +0500
Ivanko B ivankob4m...@gmail.com wrote:

  Do you mean replacing a character in an UCS-2/UCS-4 string can be
  implemented more efficiently than in an UTF-8/UTF-16 string?
 
 
 Sure, just scan the string char by char as array elements and replace
 as matches encounter. Like working with integer arrays.

Just some notes:
Often you need to replace ASCII characters like new lines, spaces or
semicolon. These can be replaced in UTF-8/UTF-16 as easily.

If you want to replace non ASCII characters for example to normalize
diacritical characters then even in UCS-2/UCS-4 you have to replace
several codepoints with one.

UCS-2 does not matter for the RTL, which must work with the full
Unicode range. And UCS-4 is a waste of space for big texts.

How many functions have you written that replaces
characters in an UTF-8/UTF-16 string with different size characters?

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-22 Thread Martin Schreiber
On Wednesday 22 August 2012 02:01:09 Hans-Peter Diettrich wrote:

 You still miss the point. Why deal with single characters, by index,
 when working with substrings also covers the single-character use?

Why not if it is faster, simpler and more intuitive for beginners?

Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-22 Thread Ivanko B
How many functions have you written that replaces
 characters in an UTF-8/UTF-16 string with different size characters?
=
Me adore UTF-8 - a great way of storing unicode text, using non-latin
passwords,.. ! But if we have the RTL string type UTF-8 then we should
also have whole RTL with optimized functions, procedures  clases for
it. Same for UCS-2( approx 50% finished), UCS-4... That's if we have a
type in RTL then we should also have its FULL support.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-22 Thread Hans-Peter Diettrich

Ivanko B schrieb:

Do you mean replacing a character in an UCS-2/UCS-4 string can be
implemented more efficiently than in an UTF-8/UTF-16 string?



Sure, just scan the string char by char as array elements and replace
as matches encounter. Like working with integer arrays.


This applies only to UCS4/UTF-32. In all other cases the overall byte 
size of both characters may vary, due to escape sequences/surrogate 
pairs. Ligatures also should be considered, so that every simplified 
approach risks to be buggy. At least the size of both characters 
should be compared, and a StringReplace should be used when both differ. 
But the same applies to StringReplace as well, where substrings of the 
same size can be replaced in-place :-)


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-22 Thread Hans-Peter Diettrich

Ivanko B schrieb:

Why deal with single characters, by index, when working with
substrings also covers the single-character use?

Possibly because it tens times as slower for multiple chars processed.


Not really. Replacing the same amount of bytes can *always* be done 
in-place.


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-22 Thread Hans-Peter Diettrich

Martin Schreiber schrieb:

On Wednesday 22 August 2012 02:01:09 Hans-Peter Diettrich wrote:

You still miss the point. Why deal with single characters, by index,
when working with substrings also covers the single-character use?


Why not if it is faster, simpler and more intuitive for beginners?


Because they will find out soon, that such an simplified approach is 
inappropriate in working with Unicode. English people had a hard time to 
accept the existence of larger character sets (than ASCII), and 
considered it other people's problem. But when talking Unicode it's 
*your* problem if your procedures fail on foreign languages or 
codepages. Ignoring ligatures or other foreign languages' constructs and 
habits will bite you, sonner or later.


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-22 Thread Ivanko B
Ignoring ligatures or other foreign languages' constructs and habits
will bite you, sonner or later.
==
To handle this,  constantly size growing fixed-char enconings exit.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-22 Thread Felipe Monteiro de Carvalho
On Wed, Aug 22, 2012 at 9:36 PM, Martin Schreiber mse00...@gmail.com wrote:
 I am not talking about Unicode. I am talking about day by day programming of
 an average programmer where the live is easier with utf-16 than with utf-8.
 Unicode is not done by using pos() instead of character indexes.
 I think everybody knows my opinion, I stop now.

Please be clear in the terminogy. Don't say live is easier with
utf-16 than with utf-8 if you don't mean utf-16 as it is. Just say
live is easier with ucs-2 than with utf-8, then everything is clear
that you are talking about ucs2 and not true utf-16.

-- 
Felipe Monteiro de Carvalho
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-22 Thread Hans-Peter Diettrich

Daniël Mantione schrieb:

Op Wed, 22 Aug 2012, schreef Felipe Monteiro de Carvalho:

On Wed, Aug 22, 2012 at 9:36 PM, Martin Schreiber mse00...@gmail.com 
wrote:
I am not talking about Unicode. I am talking about day by day 
programming of
an average programmer where the live is easier with utf-16 than with 
utf-8.

Unicode is not done by using pos() instead of character indexes.
I think everybody knows my opinion, I stop now.


Please be clear in the terminogy. Don't say live is easier with
utf-16 than with utf-8 if you don't mean utf-16 as it is. Just say
live is easier with ucs-2 than with utf-8, then everything is clear
that you are talking about ucs2 and not true utf-16.


That is nonsense.

* There are no whitespace characters beyond widechar range. This means you
  can write a routine to split a string into words without bothing about
  surrogate pairs and remain fully UTF-16 compliant.


How is this different for UTF-8?


* There are no characters with uppper/lowercase beyond widechar range.
  That means if you write cade that deals with character case you don't
  need to bother with surrogate pairs and still remain fully UTF-16
  complaint.


How expensive is a Unicode Upper/LowerCase conversion per se?


* You can group Korean letters into Korean syllables, again without
  bothering about surrogate pairs, as Korean is one of the many languages
  that is entirely in widechar range.


The same applies to English and UTF-8 ;-)
Selected languages can be handled in special ways, but not all.

Many more examples exist. It's true there exist also many examples where 
surrogates do need to be handled.


But... even if a certain piece of code doesn't handle e.g. Egyptian 
hyroglyps correctly; there is no guarantee that a UTF-8 code would do, 
since these scripts have many properties that are not compatible with 
text processing codes designed for western languages, they need a lot of 
custom code.


That's it!

In everydays coding I'm happy with AnsiStrings, covering English and 
German. But when I want to deal with Unicode, except for display-only 
purposes, I want to do it right and in the most simple way. This means 
that I'd call the functions existing (in FPC?) for detecting 
non-breakable character ranges, upper/lower case conversion etc., and 
use (sub)strings all over to get rid of any byte/wordcount issues.



You mentioned Korean syllables splitting - is this a task occuring often 
in Korean programs? I don't remember when I *ever* wanted to break 
German or English words into syllables. At the begin of computer-based 
publishing most German texts were hard to read, due to many wordbreak 
errors. Finding syllables (as possible breakpoints), in detail in 
foreign languages, still requires to use according library functions, 
which do (hopefully) proper disambiguation. In my code I'd call the 
GetSyllable function, and then split the string at the given points - 
regardless of any encoding. Or, as I really did, break strings only at 
word boundaries, again insensitive to any encoding.


Also breaking strings for display purposes, at a given pixel count, is 
expensive. It's not sufficient to find possible breakpoints, it's also 
required to narrow down the right breakpoint by repetitive tries. It's 
not a good idea to simply add the width of individual characters, 
instead the pixel width of every possible substring must be determined 
individually. This means that the efficiency does not depend much on the 
string encoding.



But another point becomes *really* important, when libraries with 
beforementioned Unicode functions are used: The application and 
libraries should use the *same* string encoding, to prevent frequent 
conversions with every function call. This suggests to use the 
library(=platform) specific string encoding, which can be different on 
e.g. Windows and Linux.


Consequently a cross-platform program should be as insensitive as 
possible to encodings, and the whole UTF-8/16 discussion turns out to be 
purely academic. This leads again to an different issue: should we 
declare an string type dedicated to Unicode text processing, which can 
vary depending on the platform/library encoding? Then everybody can 
decide whether to use one string type (RTL/FCL/LCL compatible) for 
general tasks, and the library compatible type for text processing?


Or should we bite the bullet and support different flavors of the FPC 
libraries, for best performance on any platform? This would also leave 
it to the user to select his preferred encoding, stopping any UTF 
discussion immediately :-]


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-22 Thread Martin Schreiber
On Wednesday 22 August 2012 21:47:53 Felipe Monteiro de Carvalho wrote:
 On Wed, Aug 22, 2012 at 9:36 PM, Martin Schreiber mse00...@gmail.com 
wrote:
  I am not talking about Unicode. I am talking about day by day programming
  of an average programmer where the live is easier with utf-16 than with
  utf-8. Unicode is not done by using pos() instead of character indexes. I
  think everybody knows my opinion, I stop now.

 Please be clear in the terminogy. Don't say live is easier with
 utf-16 than with utf-8 if you don't mean utf-16 as it is. Just say
 live is easier with ucs-2 than with utf-8, then everything is clear
 that you are talking about ucs2 and not true utf-16.

It is with utf-16 and known character constants of the BMP. Please try it.

Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Graeme Geldenhuys
Hi,

On 20 August 2012 23:18, Hans-Peter Diettrich drdiettri...@aol.com wrote:
 The Delphi developers wanted to implement what you suggest, but dropped that
 approach later again.

When Embarcadero implemented Unicode support, Delphi was a pure
Windows application. They had no need to think of anything other than
what Windows supports. Not to mention that they were on a tight budget
and time constraint, because every minute they waisted, they lost
clients moving to more up to date compilers and languages. So it was
all about getting something out as quickly as possible, and probably
cutting corners where possible.


 A character type is somewhat useless, unless all strings are UTF-32 (what's
 quite unlikely now). Instead substrings should be used, which can contain
 any number of bytes or characters.

I guess that depends on how you define the Char type. Is it meant to
hold a single Unicode codepoint, or a single printable character. If
the latter, then probably a bigger Char type is required.


 You also have to explain what String[4] means in an Unicode environment.

The String[] syntax in Object Pascal means you are defining a
shortstring type (irrespective of compiler mode), thus an array of
bytes. In this case 4-bytes are used to hold any Unicode codepoint.


 Q: Did you ever read about the new string implementation of FPC?

I have read some of the message threads that went around in fpc-devel,
I also worked on the cp branch before it was merged with Trunk. If you
have any other documentation in mind, please post the URL and I'll
happily take a look.



-- 
Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Graeme Geldenhuys
On 21 August 2012 07:10, Ivanko B ivankob4m...@gmail.com wrote:
 How about supporting in the RTL all versions of UCS-2  UTF-16 (for
 fast per-char access etc optimizations) and UTF-8 (for unlimited
 number of alphabets) ?

All access a char by index into a string code I have seen, 99.99% of
the time work in a sequential manner. For that reason there is no
speed difference between using a UTF-16 or UTF-8 encoded string. Both
can be coded equally efficient.

-- 
Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Martin Schreiber

Am 21.08.2012 09:55, schrieb Graeme Geldenhuys:

On 21 August 2012 07:10, Ivanko Bivankob4m...@gmail.com  wrote:

How about supporting in the RTL all versions of UCS-2  UTF-16 (for
fast per-char access etc optimizations) and UTF-8 (for unlimited
number of alphabets) ?


All access a char by index into a string code I have seen, 99.99% of
the time work in a sequential manner. For that reason there is no
speed difference between using a UTF-16 or UTF-8 encoded string. Both
can be coded equally efficient.

Graeme, this is simply not true. Searching for known German characters 
in a UnicodeString the program can use the simple approach by character 
(code unit) index. It is even possible for known Chinese symbols of the 
BMP. And a simple if for surrogate pairs is more efficent as a 4-stage 
case for utf-8.


Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Ivanko B
For that reason there is no
 speed difference between using a UTF-16 or UTF-8 encoded string. Both
 can be coded equally efficient.
==
No in common, since UTF-8 needs error handling, replacing for
unconvertable bytes etc operations which may effect initial data which
makes per-byte comparision unreliable.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Ivanko B
Me always get excited how Graeme defends the solutions of his choice :)
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Mattias Gaertner
On Tue, 21 Aug 2012 14:59:57 +0500
Ivanko B ivankob4m...@gmail.com wrote:

 For that reason there is no
  speed difference between using a UTF-16 or UTF-8 encoded string. Both
  can be coded equally efficient.
 ==
 No in common, since UTF-8 needs error handling, replacing for
 unconvertable bytes etc operations which may effect initial data which
 makes per-byte comparision unreliable.

For example?

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Hans-Peter Diettrich

Martin Schreiber schrieb:


All access a char by index into a string code I have seen, 99.99% of
the time work in a sequential manner. For that reason there is no
speed difference between using a UTF-16 or UTF-8 encoded string. Both
can be coded equally efficient.

Graeme, this is simply not true. Searching for known German characters 
in a UnicodeString the program can use the simple approach by character 
(code unit) index. It is even possible for known Chinese symbols of the 
BMP. And a simple if for surrogate pairs is more efficent as a 4-stage 
case for utf-8.


The good ole Pos() can do that, why search for more complicated 
implementations?


You still try to use old coding patterns which are simply inappropriate 
for dealing with Unicode strings. Why make a distinction between 
searching for a single character or multiple characters, when it's known 
that one character can require multiple bytes or words in UTF-8/16?


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Hans-Peter Diettrich

Graeme Geldenhuys schrieb:

On 20 August 2012 23:18, Hans-Peter Diettrich drdiettri...@aol.com wrote:

The Delphi developers wanted to implement what you suggest, but dropped that
approach later again.


When Embarcadero implemented Unicode support, Delphi was a pure
Windows application. They had no need to think of anything other than
what Windows supports.


So what? The poor performance of an variable char-size string type is 
not related to any platform.




A character type is somewhat useless, unless all strings are UTF-32 (what's
quite unlikely now). Instead substrings should be used, which can contain
any number of bytes or characters.


I guess that depends on how you define the Char type. Is it meant to
hold a single Unicode codepoint, or a single printable character. If
the latter, then probably a bigger Char type is required.


A string can contain any number of characters, including zero. Why make 
a distinction between handling a single character from handling multiple 
characters? An UTF-32 Char type will require implicit conversion into an 
string, before it can be used with strings of any other encoding. Not 
very efficient, indeed :-(




You also have to explain what String[4] means in an Unicode environment.


The String[] syntax in Object Pascal means you are defining a
shortstring type (irrespective of compiler mode), thus an array of
bytes. In this case 4-bytes are used to hold any Unicode codepoint.


Why abuse an ShortString type, when any ordinal 4-byte value will do the 
same? Did you consider that ShortStrings deserve special handling, WRT 
e.g. their Length field? The 5 bytes in memory also don't fit nicely 
into an aligned memory layout, and the compiler may insert range 
checking and other useless code. When ordinary ShortStrings have their 
own fixed encoding (CP_ACP?), you'll have to tell the compiler to ignore 
all that when dealing with your Char=String[4] type :-(



Q: Did you ever read about the new string implementation of FPC?


I have read some of the message threads that went around in fpc-devel,
I also worked on the cp branch before it was merged with Trunk. If you
have any other documentation in mind, please post the URL and I'll
happily take a look.


Then read it again, you seem to have missed essential points.

DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Martin Schreiber

Am 21.08.2012 12:52, schrieb Hans-Peter Diettrich:


The good ole Pos() can do that, why search for more complicated
implementations?

You still try to use old coding patterns which are simply inappropriate
for dealing with Unicode strings. Why make a distinction between
searching for a single character or multiple characters, when it's known
that one character can require multiple bytes or words in UTF-8/16?

I wrote known German characters and known Chinese symbols of the BMP 
for example character constants. If you want to read some examples of 
problems with utf-8 especially for pupils and Pascal beginners read the 
German Lazarus Forum or freepascal.ru. Why should we design programming 
so that it complicates the work for them? Anyway, I don't care, do what 
you want but please implement Unicode resource strings in FPC compiler.


Thanks, Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Ivanko B
If you replied to this mail then you lost me.
 I don't understand what problem of UTF-8 for the RTL you want to point
 out. Can you explain again?
==
Substringing etc manipulation only via normalizing to fixed-char type
which may be inefficient (especially because it performs for each
input argument  also for output - overhead multiplied by 3).
The ideal might be optimized (without pre/post-normalization) string
RTL with same set of procedures  functions  string related classes
for UTF-8, USC-2  possibly UCS-4 or UTF-16 with working assignments
between them.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-21 Thread Hans-Peter Diettrich

Martin Schreiber schrieb:

Am 21.08.2012 12:52, schrieb Hans-Peter Diettrich:


The good ole Pos() can do that, why search for more complicated
implementations?

You still try to use old coding patterns which are simply inappropriate
for dealing with Unicode strings. Why make a distinction between
searching for a single character or multiple characters, when it's known
that one character can require multiple bytes or words in UTF-8/16?

I wrote known German characters and known Chinese symbols of the BMP 
for example character constants. If you want to read some examples of 
problems with utf-8 especially for pupils and Pascal beginners read the 
German Lazarus Forum or freepascal.ru. Why should we design programming 
so that it complicates the work for them? Anyway, I don't care, do what 
you want but please implement Unicode resource strings in FPC compiler.


You still miss the point. Why deal with single characters, by index, 
when working with substrings also covers the single-character use?


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


[fpc-devel] Unicode in the RTL (my ideas)

2012-08-20 Thread Graeme Geldenhuys
...Continuing the discussion of a Unicode rTL in a new thread as promised...


I obviously have lot of issues with the RTL suggestions being thrown
around in the past. eg: I have heard lots about the RTL mostly likely
being UTF-16 only, or being spilt into 3 versions AnsiString, UTF-16
and UTF-8 (a maintenance nightmare). Why? Why can't you have code as
follows:


   {$IFDEF WINDOWS}
  UnicodeString = type AnsiString(CP_UTF16);
   {$ELSE}
  // probably not strictly correct, but assuming *nix here. But
you get the idea
  UnicodeString = type AnsiString(CP_UTF8);
   {$ENDIF

   String = type UnicodeString;
   Char = type String[4];   // the maximum size of a Unicode codepoint
is 4 bytes


Now the RTL can have something like


 Exception = class
 public
 property Message: string read
 end;


 TStings = class(...)
 public
 
 function Add(const AText: String); integer;
 
// I'm not 100% about the actual signature, but UTF-8 is
probably a very safe bet
// for the default, because 99.% of unicode text is stored
in UTF-8, and
// ANSI text could safely load too. If the developers knows
otherwise, they can always
// pass a different encoding constant to the function.
 procedure LoadFromFile(const AFilename: String; AEncoding:
TEncoding = cp_UTF8);
 end;


This should be pretty delphi compatible, meaning Delphi code could
probably compile under FPC Windows without much need for change. As
far as I know delphi compatibility is only meant for the Windows
platform, and Delphi code moving to FPC (not the other way round).

Also, now the locale variables can have things like the Russian
Thousand Separator (U+00A0) character stored in a Char too. For those
that didn't know, the Russian locale uses the non-breaking space as a
thousand separator, which in UTF-8 is 'C2 A0' (bytes) and takes up 2
bytes of memory. There might be other similar locale variables in
other languages that might take up more bytes per.



In general encoding conversions will be reduced on each platform, or
no conversion is needed at all, because the native encoding is always
used.


-- 
Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode in the RTL (my ideas)

2012-08-20 Thread Hans-Peter Diettrich

Graeme Geldenhuys schrieb:


   {$IFDEF WINDOWS}
  UnicodeString = type AnsiString(CP_UTF16);


AnsiStrings consist of bytes only, for good reasons (mostly 
performance). The Delphi developers wanted to implement what you 
suggest, but dropped that approach later again.


String classes have the same performance problems, so that e.g. in .NET 
it's suggested to use functions instead of string operators. In Delphi 
and FPC compiler magic is used instead of classes.



   {$ELSE}
  // probably not strictly correct, but assuming *nix here. But
you get the idea
  UnicodeString = type AnsiString(CP_UTF8);
   {$ENDIF

   String = type UnicodeString;
   Char = type String[4];   // the maximum size of a Unicode codepoint
is 4 bytes


A character type is somewhat useless, unless all strings are UTF-32 
(what's quite unlikely now). Instead substrings should be used, which 
can contain any number of bytes or characters.


You also have to explain what String[4] means in an Unicode environment. 
The ShortString type does not have an encoding, and thus is deprecated 
in a Unicode environment.


Q: Did you ever read about the new string implementation of FPC?
Do you really want to reinvent the wheel, in another incompatible way?

DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel