Re: [fpc-devel] Unicode in the RTL (my ideas)
2012/8/23 Hans-Peter Diettrich drdiettri...@aol.com: Daniël Mantione schrieb: Op Wed, 22 Aug 2012, schreef Felipe Monteiro de Carvalho: On Wed, Aug 22, 2012 at 9:36 PM, Martin Schreiber mse00...@gmail.com wrote: I am not talking about Unicode. I am talking about day by day programming of an average programmer where the live is easier with utf-16 than with utf-8. Unicode is not done by using pos() instead of character indexes. I think everybody knows my opinion, I stop now. Please be clear in the terminogy. Don't say live is easier with utf-16 than with utf-8 if you don't mean utf-16 as it is. Just say live is easier with ucs-2 than with utf-8, then everything is clear that you are talking about ucs2 and not true utf-16. That is nonsense. * There are no whitespace characters beyond widechar range. This means you can write a routine to split a string into words without bothing about surrogate pairs and remain fully UTF-16 compliant. How is this different for UTF-8? There are white space charaters beyond the char range, for example U+00A0 no-break space. So in UTF8 a white space character can be larger than 1 byte, in UTF-16 they are all 2 bytes. That is the difference. Vincent ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
Op Thu, 23 Aug 2012, schreef Hans-Peter Diettrich: Daniël Mantione schrieb: Op Wed, 22 Aug 2012, schreef Felipe Monteiro de Carvalho: On Wed, Aug 22, 2012 at 9:36 PM, Martin Schreiber mse00...@gmail.com wrote: I am not talking about Unicode. I am talking about day by day programming of an average programmer where the live is easier with utf-16 than with utf-8. Unicode is not done by using pos() instead of character indexes. I think everybody knows my opinion, I stop now. Please be clear in the terminogy. Don't say live is easier with utf-16 than with utf-8 if you don't mean utf-16 as it is. Just say live is easier with ucs-2 than with utf-8, then everything is clear that you are talking about ucs2 and not true utf-16. That is nonsense. * There are no whitespace characters beyond widechar range. This means you can write a routine to split a string into words without bothing about surrogate pairs and remain fully UTF-16 compliant. How is this different for UTF-8? Your answer exactly demonstrates how UTF-16 can result in better Unicode support: You probably consider the space the only white-space character and would have written code that only handles the space. In Unicode you have the space, the non-breaking space, the half-space and probably a few more that I am missing. * There are no characters with uppper/lowercase beyond widechar range. That means if you write cade that deals with character case you don't need to bother with surrogate pairs and still remain fully UTF-16 complaint. How expensive is a Unicode Upper/LowerCase conversion per se? I'd expect a conversion would be quite a bit faster in UTF-16, as can be a table lookup per character rather than a decode/re-encode per character. But it's not about conversion per se, everyday code deals with character case in a lot more situations. * You can group Korean letters into Korean syllables, again without bothering about surrogate pairs, as Korean is one of the many languages that is entirely in widechar range. The same applies to English and UTF-8 ;-) Selected languages can be handled in special ways, but not all. I'd disagree, because there are quite a few codepoints that can be used for English texts beyond #128, like i.e. currency symbols, or ligatures, but suppose I'd follow your reasoning, the list of languages your Unicode aware software will handle properly is: * English If are interrested in proper multi-lingual support... you won't get very far. In UTF-16 only few of the 6000 languages in the world need codepoints beyond the basic multi-lingual plane. In other words you get very far. You mentioned Korean syllables splitting - is this a task occuring often in Korean programs? Yes, in Korean this is very important, because Korean letters are written in syllables, so it's a very common conversion. There are both Unicode points for letters and for syllables. For example people when people type letters on the keyboard, you receive the letter unicode points. If you send those directly to the screen you see the individual letters; that's not correct Korean writing, you want to convert to syllables and send the Unicode points for syllables to the screen. At the begin of computer-based publishing most German texts were hard to read, due to many wordbreak errors. In western-languages, syllables are only important for word-breaks and our publishing software contains advanced syllable splitting algorithms. You'd better not use that code for Korean texts, because there exists no need to break words in that script. In general... different language, different text processing algorithms... But another point becomes *really* important, when libraries with beforementioned Unicode functions are used: The application and libraries should use the *same* string encoding, to prevent frequent conversions with every function call. This suggests to use the library(=platform) specific string encoding, which can be different on e.g. Windows and Linux. Consequently a cross-platform program should be as insensitive as possible to encodings, and the whole UTF-8/16 discussion turns out to be purely academic. This leads again to an different issue: should we declare an string type dedicated to Unicode text processing, which can vary depending on the platform/library encoding? Then everybody can decide whether to use one string type (RTL/FCL/LCL compatible) for general tasks, and the library compatible type for text processing? No disagreement here, if all your libraries are UTF-8, you don't want to convert everything. So if possible, write code to be as string type agnostic. Sometimes, however, you do need to look inside a string, and it does help to have an easy encoding then. Or should we bite the bullet and support different flavors of the FPC libraries, for best performance on any platform? This would also leave it to the user to select his preferred encoding,
Re: [fpc-devel] Unicode in the RTL (my ideas)
In our previous episode, Ivanko B said: Do you mean replacing a character in an UCS-2/UCS-4 string can be implemented more efficiently than in an UTF-8/UTF-16 string? Sure, just scan the string char by char as array elements and replace as matches encounter. Like working with integer arrays. The scanning is not what is expensive. The change of the match is. In both cases you need to reconstruct at least a 32-bit codepoint and match that against other codepoints in some datastructure. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
On Wed, 22 Aug 2012 09:34:33 +0500 Ivanko B ivankob4m...@gmail.com wrote: Do you mean replacing a character in an UCS-2/UCS-4 string can be implemented more efficiently than in an UTF-8/UTF-16 string? Sure, just scan the string char by char as array elements and replace as matches encounter. Like working with integer arrays. Just some notes: Often you need to replace ASCII characters like new lines, spaces or semicolon. These can be replaced in UTF-8/UTF-16 as easily. If you want to replace non ASCII characters for example to normalize diacritical characters then even in UCS-2/UCS-4 you have to replace several codepoints with one. UCS-2 does not matter for the RTL, which must work with the full Unicode range. And UCS-4 is a waste of space for big texts. How many functions have you written that replaces characters in an UTF-8/UTF-16 string with different size characters? Mattias ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
On Wednesday 22 August 2012 02:01:09 Hans-Peter Diettrich wrote: You still miss the point. Why deal with single characters, by index, when working with substrings also covers the single-character use? Why not if it is faster, simpler and more intuitive for beginners? Martin ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
How many functions have you written that replaces characters in an UTF-8/UTF-16 string with different size characters? = Me adore UTF-8 - a great way of storing unicode text, using non-latin passwords,.. ! But if we have the RTL string type UTF-8 then we should also have whole RTL with optimized functions, procedures clases for it. Same for UCS-2( approx 50% finished), UCS-4... That's if we have a type in RTL then we should also have its FULL support. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
Ivanko B schrieb: Do you mean replacing a character in an UCS-2/UCS-4 string can be implemented more efficiently than in an UTF-8/UTF-16 string? Sure, just scan the string char by char as array elements and replace as matches encounter. Like working with integer arrays. This applies only to UCS4/UTF-32. In all other cases the overall byte size of both characters may vary, due to escape sequences/surrogate pairs. Ligatures also should be considered, so that every simplified approach risks to be buggy. At least the size of both characters should be compared, and a StringReplace should be used when both differ. But the same applies to StringReplace as well, where substrings of the same size can be replaced in-place :-) DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
Ivanko B schrieb: Why deal with single characters, by index, when working with substrings also covers the single-character use? Possibly because it tens times as slower for multiple chars processed. Not really. Replacing the same amount of bytes can *always* be done in-place. DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
Martin Schreiber schrieb: On Wednesday 22 August 2012 02:01:09 Hans-Peter Diettrich wrote: You still miss the point. Why deal with single characters, by index, when working with substrings also covers the single-character use? Why not if it is faster, simpler and more intuitive for beginners? Because they will find out soon, that such an simplified approach is inappropriate in working with Unicode. English people had a hard time to accept the existence of larger character sets (than ASCII), and considered it other people's problem. But when talking Unicode it's *your* problem if your procedures fail on foreign languages or codepages. Ignoring ligatures or other foreign languages' constructs and habits will bite you, sonner or later. DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
Ignoring ligatures or other foreign languages' constructs and habits will bite you, sonner or later. == To handle this, constantly size growing fixed-char enconings exit. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
On Wed, Aug 22, 2012 at 9:36 PM, Martin Schreiber mse00...@gmail.com wrote: I am not talking about Unicode. I am talking about day by day programming of an average programmer where the live is easier with utf-16 than with utf-8. Unicode is not done by using pos() instead of character indexes. I think everybody knows my opinion, I stop now. Please be clear in the terminogy. Don't say live is easier with utf-16 than with utf-8 if you don't mean utf-16 as it is. Just say live is easier with ucs-2 than with utf-8, then everything is clear that you are talking about ucs2 and not true utf-16. -- Felipe Monteiro de Carvalho ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
Daniël Mantione schrieb: Op Wed, 22 Aug 2012, schreef Felipe Monteiro de Carvalho: On Wed, Aug 22, 2012 at 9:36 PM, Martin Schreiber mse00...@gmail.com wrote: I am not talking about Unicode. I am talking about day by day programming of an average programmer where the live is easier with utf-16 than with utf-8. Unicode is not done by using pos() instead of character indexes. I think everybody knows my opinion, I stop now. Please be clear in the terminogy. Don't say live is easier with utf-16 than with utf-8 if you don't mean utf-16 as it is. Just say live is easier with ucs-2 than with utf-8, then everything is clear that you are talking about ucs2 and not true utf-16. That is nonsense. * There are no whitespace characters beyond widechar range. This means you can write a routine to split a string into words without bothing about surrogate pairs and remain fully UTF-16 compliant. How is this different for UTF-8? * There are no characters with uppper/lowercase beyond widechar range. That means if you write cade that deals with character case you don't need to bother with surrogate pairs and still remain fully UTF-16 complaint. How expensive is a Unicode Upper/LowerCase conversion per se? * You can group Korean letters into Korean syllables, again without bothering about surrogate pairs, as Korean is one of the many languages that is entirely in widechar range. The same applies to English and UTF-8 ;-) Selected languages can be handled in special ways, but not all. Many more examples exist. It's true there exist also many examples where surrogates do need to be handled. But... even if a certain piece of code doesn't handle e.g. Egyptian hyroglyps correctly; there is no guarantee that a UTF-8 code would do, since these scripts have many properties that are not compatible with text processing codes designed for western languages, they need a lot of custom code. That's it! In everydays coding I'm happy with AnsiStrings, covering English and German. But when I want to deal with Unicode, except for display-only purposes, I want to do it right and in the most simple way. This means that I'd call the functions existing (in FPC?) for detecting non-breakable character ranges, upper/lower case conversion etc., and use (sub)strings all over to get rid of any byte/wordcount issues. You mentioned Korean syllables splitting - is this a task occuring often in Korean programs? I don't remember when I *ever* wanted to break German or English words into syllables. At the begin of computer-based publishing most German texts were hard to read, due to many wordbreak errors. Finding syllables (as possible breakpoints), in detail in foreign languages, still requires to use according library functions, which do (hopefully) proper disambiguation. In my code I'd call the GetSyllable function, and then split the string at the given points - regardless of any encoding. Or, as I really did, break strings only at word boundaries, again insensitive to any encoding. Also breaking strings for display purposes, at a given pixel count, is expensive. It's not sufficient to find possible breakpoints, it's also required to narrow down the right breakpoint by repetitive tries. It's not a good idea to simply add the width of individual characters, instead the pixel width of every possible substring must be determined individually. This means that the efficiency does not depend much on the string encoding. But another point becomes *really* important, when libraries with beforementioned Unicode functions are used: The application and libraries should use the *same* string encoding, to prevent frequent conversions with every function call. This suggests to use the library(=platform) specific string encoding, which can be different on e.g. Windows and Linux. Consequently a cross-platform program should be as insensitive as possible to encodings, and the whole UTF-8/16 discussion turns out to be purely academic. This leads again to an different issue: should we declare an string type dedicated to Unicode text processing, which can vary depending on the platform/library encoding? Then everybody can decide whether to use one string type (RTL/FCL/LCL compatible) for general tasks, and the library compatible type for text processing? Or should we bite the bullet and support different flavors of the FPC libraries, for best performance on any platform? This would also leave it to the user to select his preferred encoding, stopping any UTF discussion immediately :-] DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
On Wednesday 22 August 2012 21:47:53 Felipe Monteiro de Carvalho wrote: On Wed, Aug 22, 2012 at 9:36 PM, Martin Schreiber mse00...@gmail.com wrote: I am not talking about Unicode. I am talking about day by day programming of an average programmer where the live is easier with utf-16 than with utf-8. Unicode is not done by using pos() instead of character indexes. I think everybody knows my opinion, I stop now. Please be clear in the terminogy. Don't say live is easier with utf-16 than with utf-8 if you don't mean utf-16 as it is. Just say live is easier with ucs-2 than with utf-8, then everything is clear that you are talking about ucs2 and not true utf-16. It is with utf-16 and known character constants of the BMP. Please try it. Martin ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
Hi, On 20 August 2012 23:18, Hans-Peter Diettrich drdiettri...@aol.com wrote: The Delphi developers wanted to implement what you suggest, but dropped that approach later again. When Embarcadero implemented Unicode support, Delphi was a pure Windows application. They had no need to think of anything other than what Windows supports. Not to mention that they were on a tight budget and time constraint, because every minute they waisted, they lost clients moving to more up to date compilers and languages. So it was all about getting something out as quickly as possible, and probably cutting corners where possible. A character type is somewhat useless, unless all strings are UTF-32 (what's quite unlikely now). Instead substrings should be used, which can contain any number of bytes or characters. I guess that depends on how you define the Char type. Is it meant to hold a single Unicode codepoint, or a single printable character. If the latter, then probably a bigger Char type is required. You also have to explain what String[4] means in an Unicode environment. The String[] syntax in Object Pascal means you are defining a shortstring type (irrespective of compiler mode), thus an array of bytes. In this case 4-bytes are used to hold any Unicode codepoint. Q: Did you ever read about the new string implementation of FPC? I have read some of the message threads that went around in fpc-devel, I also worked on the cp branch before it was merged with Trunk. If you have any other documentation in mind, please post the URL and I'll happily take a look. -- Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://fpgui.sourceforge.net ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
On 21 August 2012 07:10, Ivanko B ivankob4m...@gmail.com wrote: How about supporting in the RTL all versions of UCS-2 UTF-16 (for fast per-char access etc optimizations) and UTF-8 (for unlimited number of alphabets) ? All access a char by index into a string code I have seen, 99.99% of the time work in a sequential manner. For that reason there is no speed difference between using a UTF-16 or UTF-8 encoded string. Both can be coded equally efficient. -- Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://fpgui.sourceforge.net ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
Am 21.08.2012 09:55, schrieb Graeme Geldenhuys: On 21 August 2012 07:10, Ivanko Bivankob4m...@gmail.com wrote: How about supporting in the RTL all versions of UCS-2 UTF-16 (for fast per-char access etc optimizations) and UTF-8 (for unlimited number of alphabets) ? All access a char by index into a string code I have seen, 99.99% of the time work in a sequential manner. For that reason there is no speed difference between using a UTF-16 or UTF-8 encoded string. Both can be coded equally efficient. Graeme, this is simply not true. Searching for known German characters in a UnicodeString the program can use the simple approach by character (code unit) index. It is even possible for known Chinese symbols of the BMP. And a simple if for surrogate pairs is more efficent as a 4-stage case for utf-8. Martin ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
For that reason there is no speed difference between using a UTF-16 or UTF-8 encoded string. Both can be coded equally efficient. == No in common, since UTF-8 needs error handling, replacing for unconvertable bytes etc operations which may effect initial data which makes per-byte comparision unreliable. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
Me always get excited how Graeme defends the solutions of his choice :) ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
On Tue, 21 Aug 2012 14:59:57 +0500 Ivanko B ivankob4m...@gmail.com wrote: For that reason there is no speed difference between using a UTF-16 or UTF-8 encoded string. Both can be coded equally efficient. == No in common, since UTF-8 needs error handling, replacing for unconvertable bytes etc operations which may effect initial data which makes per-byte comparision unreliable. For example? Mattias ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
Martin Schreiber schrieb: All access a char by index into a string code I have seen, 99.99% of the time work in a sequential manner. For that reason there is no speed difference between using a UTF-16 or UTF-8 encoded string. Both can be coded equally efficient. Graeme, this is simply not true. Searching for known German characters in a UnicodeString the program can use the simple approach by character (code unit) index. It is even possible for known Chinese symbols of the BMP. And a simple if for surrogate pairs is more efficent as a 4-stage case for utf-8. The good ole Pos() can do that, why search for more complicated implementations? You still try to use old coding patterns which are simply inappropriate for dealing with Unicode strings. Why make a distinction between searching for a single character or multiple characters, when it's known that one character can require multiple bytes or words in UTF-8/16? DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
Graeme Geldenhuys schrieb: On 20 August 2012 23:18, Hans-Peter Diettrich drdiettri...@aol.com wrote: The Delphi developers wanted to implement what you suggest, but dropped that approach later again. When Embarcadero implemented Unicode support, Delphi was a pure Windows application. They had no need to think of anything other than what Windows supports. So what? The poor performance of an variable char-size string type is not related to any platform. A character type is somewhat useless, unless all strings are UTF-32 (what's quite unlikely now). Instead substrings should be used, which can contain any number of bytes or characters. I guess that depends on how you define the Char type. Is it meant to hold a single Unicode codepoint, or a single printable character. If the latter, then probably a bigger Char type is required. A string can contain any number of characters, including zero. Why make a distinction between handling a single character from handling multiple characters? An UTF-32 Char type will require implicit conversion into an string, before it can be used with strings of any other encoding. Not very efficient, indeed :-( You also have to explain what String[4] means in an Unicode environment. The String[] syntax in Object Pascal means you are defining a shortstring type (irrespective of compiler mode), thus an array of bytes. In this case 4-bytes are used to hold any Unicode codepoint. Why abuse an ShortString type, when any ordinal 4-byte value will do the same? Did you consider that ShortStrings deserve special handling, WRT e.g. their Length field? The 5 bytes in memory also don't fit nicely into an aligned memory layout, and the compiler may insert range checking and other useless code. When ordinary ShortStrings have their own fixed encoding (CP_ACP?), you'll have to tell the compiler to ignore all that when dealing with your Char=String[4] type :-( Q: Did you ever read about the new string implementation of FPC? I have read some of the message threads that went around in fpc-devel, I also worked on the cp branch before it was merged with Trunk. If you have any other documentation in mind, please post the URL and I'll happily take a look. Then read it again, you seem to have missed essential points. DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
Am 21.08.2012 12:52, schrieb Hans-Peter Diettrich: The good ole Pos() can do that, why search for more complicated implementations? You still try to use old coding patterns which are simply inappropriate for dealing with Unicode strings. Why make a distinction between searching for a single character or multiple characters, when it's known that one character can require multiple bytes or words in UTF-8/16? I wrote known German characters and known Chinese symbols of the BMP for example character constants. If you want to read some examples of problems with utf-8 especially for pupils and Pascal beginners read the German Lazarus Forum or freepascal.ru. Why should we design programming so that it complicates the work for them? Anyway, I don't care, do what you want but please implement Unicode resource strings in FPC compiler. Thanks, Martin ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
If you replied to this mail then you lost me. I don't understand what problem of UTF-8 for the RTL you want to point out. Can you explain again? == Substringing etc manipulation only via normalizing to fixed-char type which may be inefficient (especially because it performs for each input argument also for output - overhead multiplied by 3). The ideal might be optimized (without pre/post-normalization) string RTL with same set of procedures functions string related classes for UTF-8, USC-2 possibly UCS-4 or UTF-16 with working assignments between them. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
Martin Schreiber schrieb: Am 21.08.2012 12:52, schrieb Hans-Peter Diettrich: The good ole Pos() can do that, why search for more complicated implementations? You still try to use old coding patterns which are simply inappropriate for dealing with Unicode strings. Why make a distinction between searching for a single character or multiple characters, when it's known that one character can require multiple bytes or words in UTF-8/16? I wrote known German characters and known Chinese symbols of the BMP for example character constants. If you want to read some examples of problems with utf-8 especially for pupils and Pascal beginners read the German Lazarus Forum or freepascal.ru. Why should we design programming so that it complicates the work for them? Anyway, I don't care, do what you want but please implement Unicode resource strings in FPC compiler. You still miss the point. Why deal with single characters, by index, when working with substrings also covers the single-character use? DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
[fpc-devel] Unicode in the RTL (my ideas)
...Continuing the discussion of a Unicode rTL in a new thread as promised... I obviously have lot of issues with the RTL suggestions being thrown around in the past. eg: I have heard lots about the RTL mostly likely being UTF-16 only, or being spilt into 3 versions AnsiString, UTF-16 and UTF-8 (a maintenance nightmare). Why? Why can't you have code as follows: {$IFDEF WINDOWS} UnicodeString = type AnsiString(CP_UTF16); {$ELSE} // probably not strictly correct, but assuming *nix here. But you get the idea UnicodeString = type AnsiString(CP_UTF8); {$ENDIF String = type UnicodeString; Char = type String[4]; // the maximum size of a Unicode codepoint is 4 bytes Now the RTL can have something like Exception = class public property Message: string read end; TStings = class(...) public function Add(const AText: String); integer; // I'm not 100% about the actual signature, but UTF-8 is probably a very safe bet // for the default, because 99.% of unicode text is stored in UTF-8, and // ANSI text could safely load too. If the developers knows otherwise, they can always // pass a different encoding constant to the function. procedure LoadFromFile(const AFilename: String; AEncoding: TEncoding = cp_UTF8); end; This should be pretty delphi compatible, meaning Delphi code could probably compile under FPC Windows without much need for change. As far as I know delphi compatibility is only meant for the Windows platform, and Delphi code moving to FPC (not the other way round). Also, now the locale variables can have things like the Russian Thousand Separator (U+00A0) character stored in a Char too. For those that didn't know, the Russian locale uses the non-breaking space as a thousand separator, which in UTF-8 is 'C2 A0' (bytes) and takes up 2 bytes of memory. There might be other similar locale variables in other languages that might take up more bytes per. In general encoding conversions will be reduced on each platform, or no conversion is needed at all, because the native encoding is always used. -- Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://fpgui.sourceforge.net ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode in the RTL (my ideas)
Graeme Geldenhuys schrieb: {$IFDEF WINDOWS} UnicodeString = type AnsiString(CP_UTF16); AnsiStrings consist of bytes only, for good reasons (mostly performance). The Delphi developers wanted to implement what you suggest, but dropped that approach later again. String classes have the same performance problems, so that e.g. in .NET it's suggested to use functions instead of string operators. In Delphi and FPC compiler magic is used instead of classes. {$ELSE} // probably not strictly correct, but assuming *nix here. But you get the idea UnicodeString = type AnsiString(CP_UTF8); {$ENDIF String = type UnicodeString; Char = type String[4]; // the maximum size of a Unicode codepoint is 4 bytes A character type is somewhat useless, unless all strings are UTF-32 (what's quite unlikely now). Instead substrings should be used, which can contain any number of bytes or characters. You also have to explain what String[4] means in an Unicode environment. The ShortString type does not have an encoding, and thus is deprecated in a Unicode environment. Q: Did you ever read about the new string implementation of FPC? Do you really want to reinvent the wheel, in another incompatible way? DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel