Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On Mon, Jan 7, 2013 at 6:05 PM, Mark Morgan Lloyd < markmll.fpc-de...@telemetry.co.uk> wrote: > Tomas Hajny wrote: > >> On Mon, January 7, 2013 13:28, Ewald wrote: >> >>> Once upon a time, on 01/07/2013 12:39 PM to be precise, Michael Schnell >>> said: >>> On 01/05/2013 12:28 PM, Jonas Maebe wrote: > Using whatever #xx#xx or #xx#xx#xx sequence represents the UTF-8 > encoding of that character. > Sorry, I can't follow. Does #xx not just define a numerical representation of an 8 bit entity ? The interpretation in any code might be done later by any code that digests the string. Am I wrong ? >>> I *think* Jonas is trying to say that if you want the character `Ǿ` in a >>> string you would either type >>> - 'Ǿ' or >>> - #$C7#$BE if you want to keep the source free of encoding specific >>> characters >>> >> . >> . >> >> ...or >> - #$01FE and then the whole string becomes a Unicode string which is >> either kept that way (if it is assigned to a UnicodeString constant), or >> it is converted to some 8-bit encoding at compile time (if it is assigned >> to an 8-bit constant/variable like ansistring) >> >> (also just my understanding of what Jonas wrote) >> > > That's how I read it as well. In which case, is #A3 16-bit Unicode > (representing the UK £ Sterling) or malformed UTF-8 (should be #c2#a3)? > The way I understand it is that #A3 will be effected by $codepage directive of source file. So, if programmer correctly sets $codepage to match encoding used in editor (be it utf8 or some other encoding), compiler will also 'understand' that string correctly. If programmer never uses UnicodeString, and always uses codepage which was used to write source code, everything will work fine - #A3 will stay whatever it is in specific encoding. On the other hand, if there comes situation in which string containing #A3 needs to be converted to UnicodeString, compiler will either: a) convert it correctly to UnicodeString if encoding used is utf8, or b) call system-specific function to convert string to array of WideChar-s (in which case, correctness of the program depends on support for specific encoding on tharget system). ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
Tomas Hajny wrote: On Mon, January 7, 2013 13:28, Ewald wrote: Once upon a time, on 01/07/2013 12:39 PM to be precise, Michael Schnell said: On 01/05/2013 12:28 PM, Jonas Maebe wrote: Using whatever #xx#xx or #xx#xx#xx sequence represents the UTF-8 encoding of that character. Sorry, I can't follow. Does #xx not just define a numerical representation of an 8 bit entity ? The interpretation in any code might be done later by any code that digests the string. Am I wrong ? I *think* Jonas is trying to say that if you want the character `Ǿ` in a string you would either type - 'Ǿ' or - #$C7#$BE if you want to keep the source free of encoding specific characters . . ...or - #$01FE and then the whole string becomes a Unicode string which is either kept that way (if it is assigned to a UnicodeString constant), or it is converted to some 8-bit encoding at compile time (if it is assigned to an 8-bit constant/variable like ansistring) (also just my understanding of what Jonas wrote) That's how I read it as well. In which case, is #A3 16-bit Unicode (representing the UK £ Sterling) or malformed UTF-8 (should be #c2#a3)? -- Mark Morgan Lloyd markMLl .AT. telemetry.co .DOT. uk [Opinions above are the author's, not those of his employers or colleagues] ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
Once upon a time, on 01/07/2013 05:05 PM to be precise, Tomas Hajny said: > On Mon, January 7, 2013 14:19, Michael Schnell wrote: >> On 01/07/2013 02:01 PM, Tomas Hajny wrote: >>> (also just my understanding of what Jonas wrote) >> I feel you are wrong. The string does not know about the code it's >> content is to be interpreted in (other than with Delphi XE). > Sorry, your way of quoting makes it difficult for others to react. > > I freely admit that I may be wrong, but I don't understand what you meant > with your comment and thus I don't understand in what way you I am wrong > in your view. The compiler obviously knows how the constant is used within > the source code and thus it may proceed accordingly (i.e. either convert > it to some 8-bit encoding at compile time if UTF-16 code constant appears > in the source, or keep it in UTF-16 if assigned to a UnicodeString > constant). Yep, the compiler does know how the constant is used and how it is defined (how else could it generate working code?), but I don't see how it could do something with it if it is assigned to another type of string (by type I mean `one-byte versus two-byte`). The compiler can't know for sure what you mean, it can do at least these things: - Copy data without translating, so a one char two-byte string becomes a two char one-byte string; a three char one-byte string would become a three char two byte string; and then there is a pardox: should a three-char two-byte string become a six-char one-byte string? ==> this is probably not how it is done - Translate the meanings of the characters of the string, but here the compiler needs to know in what encoding they are and in what encoding the string is wanted. (which it doesn't I believe; the $codepage directive is only used for the encoding of the characters in the unit intself) ==> I think this also isn't a a possibility - Copy the data byte per byte, but then a one-byte string containing an uneven amount of chars needs padding + there are issues with endianness here ==> Not really an option no? - Truncate every value of a two-byte string to convert it two a one byte string; the other way around would put each character of the one-byte string as one in the two-byte string ==> Solves the first paradox, but introduces loss of data ==> All the above options (except the translation, that is) ignore the escape charachter(s) of the string, so you wont get the data you want. IMO I don't think it (typecasting a one-byte string to a two-byte string) can be done without human intervention. Look at it this way: typecasting a thread handle to an integer makes no sense either: - They are both related (a thread handle is definitely a number, even if it is a pointer) - But putting one in the other makes no sense at all: what does `comparing whether a thread id is less than zero` mean? on the other hand `comparing whether an integer is less than zero` has a distinct meaning. - The sizes may be different (say an integer of 16 bit long and a thread handle of 64 bit long), how do you put one in the other? Sum the bytes together? Multiply them? Take the 16 bit CRC of the handle? This is IMO the same with a one-byte char and a two byte char: - They both represent letters/words/... - But they are not the same and cannot be typecasted without extra knowlegde. This last point is also valid for my example above: you could put all thread ids you know of in a lookup-table and put the index in that lookup-table in the 16-bit integer. Fixed. Same goes for our strings: if you know one is UTF-8 and you want to convert it to UTF-16 it can be done without error, but without this extra knowledge it can't give you decisive results. Just a few points I think bear some potential to contemplate over a cup of $c0ffee ;-) -- Ewald ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On Mon, January 7, 2013 14:19, Michael Schnell wrote: > On 01/07/2013 02:01 PM, Tomas Hajny wrote: >> (also just my understanding of what Jonas wrote) > > I feel you are wrong. The string does not know about the code it's > content is to be interpreted in (other than with Delphi XE). Sorry, your way of quoting makes it difficult for others to react. I freely admit that I may be wrong, but I don't understand what you meant with your comment and thus I don't understand in what way you I am wrong in your view. The compiler obviously knows how the constant is used within the source code and thus it may proceed accordingly (i.e. either convert it to some 8-bit encoding at compile time if UTF-16 code constant appears in the source, or keep it in UTF-16 if assigned to a UnicodeString constant). Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
Once upon a time, on 01/07/2013 02:17 PM to be precise, Michael Schnell said: > So the ambiguity with _filling_ a string with data in fact arises > when _not_ using the #nn notation :-) . With #nn the effect (i.e. the > resulting binary) is obvious. Well, if there is literally the sequence $C7, $BE in your source code (that is, open up a hex editor and actually see the values there, as one byte each) that would also do the same, as the compiler will default to one byte strings I think. The only issue with this is that you also need to set your code editor to the encoding you want 'cause otherwise it will screw up the display and possible binary value of the character. So, yes I would say the #nn notation is probably the safest to use, also handy if your character contains (or is) something that `cannot be there`, like a newline: #10 (or #13#10 under windows) Also, if you use a literal utf-16 char in the code (so no #, but the actual character) I think the {$codepage utf16} directive might come in handy, as otherwise the compiler will interpret this series of bytes as sperate single bytes characters. This is however not an issue with the # notation, as there is no ambiguity with this interpretation. -- Ewald ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On 01/07/2013 02:01 PM, Tomas Hajny wrote: (also just my understanding of what Jonas wrote) I feel you are wrong. The string does not know about the code it's content is to be interpreted in (other than with Delphi XE). -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
So the ambiguity with _filling_ a string with data in fact arises when _not_ using the #nn notation :-) . With #nn the effect (i.e. the resulting binary) is obvious. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On Mon, January 7, 2013 13:28, Ewald wrote: > Once upon a time, on 01/07/2013 12:39 PM to be precise, Michael Schnell > said: >> On 01/05/2013 12:28 PM, Jonas Maebe wrote: >>> Using whatever #xx#xx or #xx#xx#xx sequence represents the UTF-8 >>> encoding of that character. >> Sorry, I can't follow. Does #xx not just define a numerical >> representation of an 8 bit entity ? >> >> The interpretation in any code might be done later by any code that >> digests the string. >> >> Am I wrong ? > I *think* Jonas is trying to say that if you want the character `Ǿ` in a > string you would either type > - 'Ǿ' or > - #$C7#$BE if you want to keep the source free of encoding specific > characters . . ...or - #$01FE and then the whole string becomes a Unicode string which is either kept that way (if it is assigned to a UnicodeString constant), or it is converted to some 8-bit encoding at compile time (if it is assigned to an 8-bit constant/variable like ansistring) (also just my understanding of what Jonas wrote) Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
Once upon a time, on 01/07/2013 12:39 PM to be precise, Michael Schnell said: > On 01/05/2013 12:28 PM, Jonas Maebe wrote: >> Using whatever #xx#xx or #xx#xx#xx sequence represents the UTF-8 >> encoding of that character. > Sorry, I can't follow. Does #xx not just define a numerical > representation of an 8 bit entity ? > > The interpretation in any code might be done later by any code that > digests the string. > > Am I wrong ? I *think* Jonas is trying to say that if you want the character `Ǿ` in a string you would either type - 'Ǿ' or - #$C7#$BE if you want to keep the source free of encoding specific characters You as a programmer make up what you do with it afterwards, if you decide to write it to an UTF-8 terminal, you would get `Ǿ`, and if you write it to some other terminal you might see a character that matches $C7, followed by a character that matches $BE in the lookuptable of the encoding of the terminal. Look at it this way: the byte sequence ($C7, $BE) has got no meaning to the compiler whatsoever, it is a byte sequence. That's what matters to the compiler, what is in this sequence is for you to decide. Correct me if I'm wrong. -- Ewald ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On 01/05/2013 01:35 PM, Jy V wrote: I do vote for UTF-8 -1 Regarding that conversions in the RTL (or LCL) are a rather seldom runtime-task, GUI performance issues are not really necessary to be considered. Viable issues seem to be Delphi compatibility, backward compatibility, usability, runtime-performance with time consuming complex string tasks (these seem to vote against UTF8, but for either static UTF 16 or (quasi-) dynamical (CE-alike) encoding; and memory usage and runtime-performance with time consuming simple string tasks (which vote for locale-based ANSI or UTF-8). -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On 01/05/2013 01:10 PM, Michael Van Canneyt wrote: String = Ansistring. Which is the mother of all confusion, IMHO :-[ . -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On 01/05/2013 12:28 PM, Jonas Maebe wrote: Using whatever #xx#xx or #xx#xx#xx sequence represents the UTF-8 encoding of that character. Sorry, I can't follow. Does #xx not just define a numerical representation of an 8 bit entity ? The interpretation in any code might be done later by any code that digests the string. Am I wrong ? -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On Sun, 6 Jan 2013, Hans-Peter Diettrich wrote: Michael Van Canneyt schrieb: IMO resource strings are for display purposes, so that UTF-8/16 encoding is expected by an OS API. AFAIR Win32 string resources are stored in UTF-16, You are very much wrong. Not really. I was talking about Win32 resources, not about what FPC makes from resourcestring. The discussion is about unnecessary conversions in *FPC resourcestrings*, not about win32 resources. Why you brought up the Windows resourcestrings was (and is) a mystery to me. From your statement, I assumed that you probably thought FPC stores it's resourcestrings as win32 resources. It does not. To start with, resource strings are not stored as Win32 resources. Secondly, they are stored in the code as an ansistring. The resource string of the above example is stored as: .globl _$PROGRAM$_Ld2 _$PROGRAM$_Ld2: .ascii "Something\000" .balign 8 .short 0,1 .long 0 .quad -1,15 .globl _$PROGRAM$_Ld3 Thirdly: in my apps, no UTF-8/16 encoding is expected by the OS. If it were, I would have used widestrings instead of ansistring to begin with, and in that case I would not have made any remark... I don't know which OS you're using, but the WinAPI uses UTF-16 throughout. I use both windows and Linux. You are mistakenly assuming that I am using Windows GUI calls or so. There is no GUI. Probably the only call that cares about codepage is FileCreate(), and that is not done using resource strings. For the rest, all is done using FileWrite() and sendto()/recvfrom(). Both do not care about encoding. They transfer bytes, that's it. So I use ansistrings throughout. And hence, resourcestrings being stored in unicode format would cause totally unnecessary conversions. Michael. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
Paul Ishenin schrieb: 05.01.13, 23:54, Michael Van Canneyt пишет: You are very much wrong. To start with, resource strings are not stored as Win32 resources. I personally think that resources should be stored in their native formats where is possible. This will allow to change them using software designed for that environment. For example for windows there are many resource editors which can replace icons, bitmap and string resources too. It would be nice to have this ability also for binaries which FPC do. On OS X resources are also stored different from what FPC do currently - they are stored in application bundles as I know, so they can be edited by external programs too. Point taken :-) But I'm not sure about nowadays use of native resources. Even on Windows most programs nowadays don't use Windows resources for their menus, dialog boxes etc. any more. I've used the Delphi ResourceWorkshop for some time, to tweak some third party programs and even Windows itself. This will be almost impossible with current software. Try e.g. to set the Windows menu color to yellow, what I did for a long time, and you'll find out that the Explorer and many other Windows tools don't honor that setting. Or you'll find that these system settings have been removed at all, replaced perhaps by themes? So I'm not sure about the use of native resources, nowadays. How should a multi-platform application handle a string or graphical (icon...) resource, so that it can be designed on one platform, and be shown on all other platforms without modifications? With graphical resources I'd use a single internal (FPC) format, which is converted by the widgetset for use on the target platform. String resources may require more adjustments than only a translation, to match the different semantics of other languages - independently from the target platform. That's why I'd suggest UTF-8 encoding for resource strings, what doesn't affect program logic because AnsiString still can be used. The *encoded* AnsiStrings require that the coder knows about the best encoding of every string, when he wants to reduce the number of implicit string conversions. Using AnsiString(CP_ACP) may be a reasonable decision for use in a program with *very* limited usage (one country, one language, one target platform...), but FPC should support programs with a broader audience as well. DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
Michael Van Canneyt schrieb: IMO resource strings are for display purposes, so that UTF-8/16 encoding is expected by an OS API. AFAIR Win32 string resources are stored in UTF-16, You are very much wrong. Not really. I was talking about Win32 resources, not about what FPC makes from resourcestring. To start with, resource strings are not stored as Win32 resources. Secondly, they are stored in the code as an ansistring. The resource string of the above example is stored as: .globl _$PROGRAM$_Ld2 _$PROGRAM$_Ld2: .ascii "Something\000" .balign 8 .short 0,1 .long 0 .quad -1,15 .globl _$PROGRAM$_Ld3 Thirdly: in my apps, no UTF-8/16 encoding is expected by the OS. If it were, I would have used widestrings instead of ansistring to begin with, and in that case I would not have made any remark... I don't know which OS you're using, but the WinAPI uses UTF-16 throughout. I suppose that other OS also use some Unicode string representation, for lossless representation of texts of all languages. The dual W/A interface of Win32 is due to the stripped-down Win9x versions, which require Unicode extensions for supporting more than CP_ACP. But now we are in 2013, with Unicode being present everywhere. So the conversion really would be 100% totally redundant. It may look so to you... Why then do you use resourcestring instead of ordinary string constants? Another note and question, about multi-lingual resources. Windows resource scripts (.RC) allow for multi-lingual stringtables. In my recent research I learned that the resource compiler extracts the requested language from the script, and stores only these strings in the resource file (.RES) and application (.EXE, .DLL). That's why resourcestring was added to Delphi. How does FPC support the same? (.PO files?) DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On Sat, 5 Jan 2013, Paul Ishenin wrote: 05.01.13, 23:54, Michael Van Canneyt пишет: You are very much wrong. To start with, resource strings are not stored as Win32 resources. I personally think that resources should be stored in their native formats where is possible. This will allow to change them using software designed for that environment. For example for windows there are many resource editors which can replace icons, bitmap and string resources too. It would be nice to have this ability also for binaries which FPC do. On OS X resources are also stored different from what FPC do currently - they are stored in application bundles as I know, so they can be edited by external programs too. And Jonas is worried about the overhead in the compiler by simple 1/2 byte format ? I doubt this will relieve his worries ;-) The idea of FPC's resourcestrings implementation has always been to be independent of any OS features so it is a cross-platform solution. That's how it is implemented, and as far as I am concerned that's how it should stay as the default. If we do what you think by default, then people making a cross-platform app will need to start using different technologies to translate their strings. I doubt that is a good solution. Currently, people that want to use native win32/OSX resource strings always have the option of doing so, but no special language support for it exists. Michael.___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
05.01.13, 23:54, Michael Van Canneyt пишет: You are very much wrong. To start with, resource strings are not stored as Win32 resources. I personally think that resources should be stored in their native formats where is possible. This will allow to change them using software designed for that environment. For example for windows there are many resource editors which can replace icons, bitmap and string resources too. It would be nice to have this ability also for binaries which FPC do. On OS X resources are also stored different from what FPC do currently - they are stored in application bundles as I know, so they can be edited by external programs too. Best regards, Paul Ishenin ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On Sat, 5 Jan 2013, Hans-Peter Diettrich wrote: Michael Van Canneyt schrieb: On Sat, 5 Jan 2013, Jonas Maebe wrote: On 05 Jan 2013, at 12:53, Paul Ishenin wrote: ResourceStrings are stored as AnsiString type with 0 codepage (as I remember). Delphi now stores ResourceStrings as UnicodeString type. I think FPC will follow this in m_default_unicodestring modeswitch. It would probably even be better to always do that. At least I don't see a downside, other than slightly larger binaries (and that's not an issue in this case as far as I'm concerned; maintaining two separate resourcestring systems/handlers is just not worth the trouble). But it means that for Resourcestring AString = 'Something'; Var S : Ansistring; begin S:=AString; end. Always a conversion will happen. I do not think this is a good idea given that currently, String = Ansistring. IMO resource strings are for display purposes, so that UTF-8/16 encoding is expected by an OS API. AFAIR Win32 string resources are stored in UTF-16, You are very much wrong. To start with, resource strings are not stored as Win32 resources. Secondly, they are stored in the code as an ansistring. The resource string of the above example is stored as: .globl _$PROGRAM$_Ld2 _$PROGRAM$_Ld2: .ascii "Something\000" .balign 8 .short 0,1 .long 0 .quad -1,15 .globl _$PROGRAM$_Ld3 Thirdly: in my apps, no UTF-8/16 encoding is expected by the OS. If it were, I would have used widestrings instead of ansistring to begin with, and in that case I would not have made any remark... So the conversion really would be 100% totally redundant. Michael. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
Michael Van Canneyt schrieb: On Sat, 5 Jan 2013, Jonas Maebe wrote: On 05 Jan 2013, at 12:53, Paul Ishenin wrote: ResourceStrings are stored as AnsiString type with 0 codepage (as I remember). Delphi now stores ResourceStrings as UnicodeString type. I think FPC will follow this in m_default_unicodestring modeswitch. It would probably even be better to always do that. At least I don't see a downside, other than slightly larger binaries (and that's not an issue in this case as far as I'm concerned; maintaining two separate resourcestring systems/handlers is just not worth the trouble). But it means that for Resourcestring AString = 'Something'; Var S : Ansistring; begin S:=AString; end. Always a conversion will happen. I do not think this is a good idea given that currently, String = Ansistring. IMO resource strings are for display purposes, so that UTF-8/16 encoding is expected by an OS API. AFAIR Win32 string resources are stored in UTF-16, so that assignments to an AnsiString already require a conversion. So IMO UTF-8 would be better, for now and in future. DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On Sat, 5 Jan 2013, Sven Barth wrote: On 05.01.2013 14:16, Michael Van Canneyt wrote: On Sat, 5 Jan 2013, Jonas Maebe wrote: On 05 Jan 2013, at 13:10, Michael Van Canneyt wrote: On Sat, 5 Jan 2013, Jonas Maebe wrote: On 05 Jan 2013, at 12:53, Paul Ishenin wrote: ResourceStrings are stored as AnsiString type with 0 codepage (as I remember). Delphi now stores ResourceStrings as UnicodeString type. I think FPC will follow this in m_default_unicodestring modeswitch. It would probably even be better to always do that. At least I don't see a downside, other than slightly larger binaries (and that's not an issue in this case as far as I'm concerned; maintaining two separate resourcestring systems/handlers is just not worth the trouble). But it means that for Resourcestring AString = 'Something'; Var S : Ansistring; begin S:=AString; end. Always a conversion will happen. I do not think this is a good idea given that currently, String = Ansistring. String will always be shortstring or ansistring in the syntax modes in which that is currently the case. And yes, it will involve a conversion in that case. Just like every single constant string assignment to an ansistring in 2.6.x in case the constant string contains non-ASCII characters and is part of a {$codepage xxx} file (because those strings are all stored as unicodestring in the program there). Judging by all the code that I have written during 14 years, there would never be a single conversion necessary. This system would force them on me for every single use. I do not think that the support of both ansi/unicode string resources is such a burden that it justifies that. I admittedly have limited knowledge of compiler internals, but I cannot imagine that being able to store them in 2 formats (ansi and some form of unicode) is more than a matter of maintaining 1 flag per string, and writing a word instead of a byte. All the other code, needed for conversions depending on codepage and whatnot settings, is necessary anyway. You forget also the code necessary to translate resourcestrings (at runtime). Currently the ResourceString related code inside rtl/objpas/objpas.pp only handles AnsiString and then this would need to be adjusted so that UnicodeString can also be handled. For example there will be the need for a "SetResourceStrings" overload with a UnicodeString based TResourceIterator. No, I had I though of that. It will need to be changed anyhow, and fell under "is necessary anyway", since we'll need some kind of backwards-compatibility mechanism. Michael. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On 05.01.2013 14:16, Michael Van Canneyt wrote: On Sat, 5 Jan 2013, Jonas Maebe wrote: On 05 Jan 2013, at 13:10, Michael Van Canneyt wrote: On Sat, 5 Jan 2013, Jonas Maebe wrote: On 05 Jan 2013, at 12:53, Paul Ishenin wrote: ResourceStrings are stored as AnsiString type with 0 codepage (as I remember). Delphi now stores ResourceStrings as UnicodeString type. I think FPC will follow this in m_default_unicodestring modeswitch. It would probably even be better to always do that. At least I don't see a downside, other than slightly larger binaries (and that's not an issue in this case as far as I'm concerned; maintaining two separate resourcestring systems/handlers is just not worth the trouble). But it means that for Resourcestring AString = 'Something'; Var S : Ansistring; begin S:=AString; end. Always a conversion will happen. I do not think this is a good idea given that currently, String = Ansistring. String will always be shortstring or ansistring in the syntax modes in which that is currently the case. And yes, it will involve a conversion in that case. Just like every single constant string assignment to an ansistring in 2.6.x in case the constant string contains non-ASCII characters and is part of a {$codepage xxx} file (because those strings are all stored as unicodestring in the program there). Judging by all the code that I have written during 14 years, there would never be a single conversion necessary. This system would force them on me for every single use. I do not think that the support of both ansi/unicode string resources is such a burden that it justifies that. I admittedly have limited knowledge of compiler internals, but I cannot imagine that being able to store them in 2 formats (ansi and some form of unicode) is more than a matter of maintaining 1 flag per string, and writing a word instead of a byte. All the other code, needed for conversions depending on codepage and whatnot settings, is necessary anyway. You forget also the code necessary to translate resourcestrings (at runtime). Currently the ResourceString related code inside rtl/objpas/objpas.pp only handles AnsiString and then this would need to be adjusted so that UnicodeString can also be handled. For example there will be the need for a "SetResourceStrings" overload with a UnicodeString based TResourceIterator. Regards, Sven ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On Sat, 5 Jan 2013, Jonas Maebe wrote: On 05 Jan 2013, at 13:10, Michael Van Canneyt wrote: On Sat, 5 Jan 2013, Jonas Maebe wrote: On 05 Jan 2013, at 12:53, Paul Ishenin wrote: ResourceStrings are stored as AnsiString type with 0 codepage (as I remember). Delphi now stores ResourceStrings as UnicodeString type. I think FPC will follow this in m_default_unicodestring modeswitch. It would probably even be better to always do that. At least I don't see a downside, other than slightly larger binaries (and that's not an issue in this case as far as I'm concerned; maintaining two separate resourcestring systems/handlers is just not worth the trouble). But it means that for Resourcestring AString = 'Something'; Var S : Ansistring; begin S:=AString; end. Always a conversion will happen. I do not think this is a good idea given that currently, String = Ansistring. String will always be shortstring or ansistring in the syntax modes in which that is currently the case. And yes, it will involve a conversion in that case. Just like every single constant string assignment to an ansistring in 2.6.x in case the constant string contains non-ASCII characters and is part of a {$codepage xxx} file (because those strings are all stored as unicodestring in the program there). Judging by all the code that I have written during 14 years, there would never be a single conversion necessary. This system would force them on me for every single use. I do not think that the support of both ansi/unicode string resources is such a burden that it justifies that. I admittedly have limited knowledge of compiler internals, but I cannot imagine that being able to store them in 2 formats (ansi and some form of unicode) is more than a matter of maintaining 1 flag per string, and writing a word instead of a byte. All the other code, needed for conversions depending on codepage and whatnot settings, is necessary anyway. Michael. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
> Yes, the exception is probably UTF-8 on Unix systems, but is that really > worth it to complicate the compiler and RTL? Resourcestings are generally > not used in performance-critical code, I'd assume. Always using UTF-8 is > however also fine for me, I do vote for UTF-8 > btw. I just don't believe it is worth the trouble to support both > unicodestring and ansistring resourcestrings. > I agree. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On 05 Jan 2013, at 13:33, Martin Schreiber wrote: > On Saturday 05 January 2013 12:57:44 Jonas Maebe wrote: >> On 05 Jan 2013, at 12:53, Martin Schreiber wrote: >>> So compiled with -Fcutf8 >>> " >>> unicodestringvar:= 'Best'#228'tigung'; >>> " >>> produces a different result on fixes_2_6 and trunk? I assume in trunk >>> there will be a compile error? >> >> No. In both cases it results in a widestring with this content: >> >> .short 66,101,115,116,228,116,105,103,117,110,103,0 >> >> I guess invalid utf-8 values are just copied through by the compiler. As >> mentioned: absolutely nothing whatsoever changed in how character sequences >> are interpreted by the compiler in 2.7.x. The explanation you quoted above >> (and which I deleted) applies to both 2.6.x and 2.7.x. I really don't know >> how I can say this in another way, and repeating it clearly doesn't help. >> >> I think it's best if you compile trunk for yourself and test as many >> scenarios as you can, because I feel I cannot add anything further to the >> discussion, and I'm not interested in playing compile bot. >> > Then it was a misunderstanding again No, it was simply an omission in my explanation. As mentioned above: "I guess invalid utf-8 values are just copied through by the compiler". It's a special case, but the special case is the same in 2.6.x and 2.7.x (2.6.x converts the UTF-8 string to UTF-16 immediately in the scanner, while 2.7.x does it while processing the assignment; the actual conversion code that's used is however exactly same). The fact that everything remains 100% the same in all cases everywhere always between 2.6.x and 2.7.x has been mentioned at least 10 times in this thread, and that's what I keep trying to make clear. But I give up. Jonas___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On Saturday 05 January 2013 12:57:44 Jonas Maebe wrote: > On 05 Jan 2013, at 12:53, Martin Schreiber wrote: > > So compiled with -Fcutf8 > > " > > unicodestringvar:= 'Best'#228'tigung'; > > " > > produces a different result on fixes_2_6 and trunk? I assume in trunk > > there will be a compile error? > > No. In both cases it results in a widestring with this content: > > .short66,101,115,116,228,116,105,103,117,110,103,0 > > I guess invalid utf-8 values are just copied through by the compiler. As > mentioned: absolutely nothing whatsoever changed in how character sequences > are interpreted by the compiler in 2.7.x. The explanation you quoted above > (and which I deleted) applies to both 2.6.x and 2.7.x. I really don't know > how I can say this in another way, and repeating it clearly doesn't help. > > I think it's best if you compile trunk for yourself and test as many > scenarios as you can, because I feel I cannot add anything further to the > discussion, and I'm not interested in playing compile bot. > Then it was a misunderstanding again because I read " Alternatively, in both cases you can instead define a unicodestring/widestring constant instead of an ansistring/shortstring constant by embedding widechar constants in the character sequence. Such widechar constants are of the form # with a valid Pascal representation of an integer constant between 255 and 65535. " and " Whether or not they contain character literals whose value is >#127 in the source code's code page, or explicit "#xx", "#xxx" etc expressions has no influence, nothing changed in the compiler in that account. " and " I have no idea how anything I wrote suggests that it wouldn't. As mentioned, the only difference is that string constants containing characters >#127 are no longer always converted to unicodestring constants at compile time. " --> >#255 <> >#127 and the question arose how can one define "widechar constants" for strings without a character value >255. Martin Martin ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On 05 Jan 2013, at 13:10, Michael Van Canneyt wrote: > On Sat, 5 Jan 2013, Jonas Maebe wrote: > >> >> On 05 Jan 2013, at 12:53, Paul Ishenin wrote: >> >>> ResourceStrings are stored as AnsiString type with 0 codepage (as I >>> remember). Delphi now stores ResourceStrings as UnicodeString type. I think >>> FPC will follow this in m_default_unicodestring modeswitch. >> >> It would probably even be better to always do that. At least I don't see a >> downside, other than slightly larger binaries (and that's not an issue in >> this case as far as I'm concerned; maintaining two separate resourcestring >> systems/handlers is just not worth the trouble). > > But it means that for > > Resourcestring > AString = 'Something'; > > Var > S : Ansistring; > > begin > S:=AString; > end. > > Always a conversion will happen. > > I do not think this is a good idea given that currently, String = Ansistring. String will always be shortstring or ansistring in the syntax modes in which that is currently the case. And yes, it will involve a conversion in that case. Just like every single constant string assignment to an ansistring in 2.6.x in case the constant string contains non-ASCII characters and is part of a {$codepage xxx} file (because those strings are all stored as unicodestring in the program there). Then again, it will also involve a conversion if the implementation using ansistrings is fixed to supported non-ASCII resourcestrings and the system codepage is different from the code page in which the resource string has been stored by the compiler. In fact, it will then cause two conversions on most systems (few systems can directly transcode from arbitrary code page X to arbitrary code page Y; most use UTF-16 as intermediate format, although some can probably also use UTF-8). Yes, the exception is probably UTF-8 on Unix systems, but is that really worth it to complicate the compiler and RTL? Resourcestings are generally not used in performance-critical code, I'd assume. Always using UTF-8 is however also fine for me, btw. I just don't believe it is worth the trouble to support both unicodestring and ansistring resourcestrings. Jonas___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On Sat, 5 Jan 2013, Jonas Maebe wrote: On 05 Jan 2013, at 12:53, Paul Ishenin wrote: ResourceStrings are stored as AnsiString type with 0 codepage (as I remember). Delphi now stores ResourceStrings as UnicodeString type. I think FPC will follow this in m_default_unicodestring modeswitch. It would probably even be better to always do that. At least I don't see a downside, other than slightly larger binaries (and that's not an issue in this case as far as I'm concerned; maintaining two separate resourcestring systems/handlers is just not worth the trouble). But it means that for Resourcestring AString = 'Something'; Var S : Ansistring; begin S:=AString; end. Always a conversion will happen. I do not think this is a good idea given that currently, String = Ansistring. Michael. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On 05 Jan 2013, at 12:53, Paul Ishenin wrote: > ResourceStrings are stored as AnsiString type with 0 codepage (as I > remember). Delphi now stores ResourceStrings as UnicodeString type. I think > FPC will follow this in m_default_unicodestring modeswitch. It would probably even be better to always do that. At least I don't see a downside, other than slightly larger binaries (and that's not an issue in this case as far as I'm concerned; maintaining two separate resourcestring systems/handlers is just not worth the trouble). Jonas___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On 05 Jan 2013, at 12:53, Martin Schreiber wrote: > So compiled with -Fcutf8 > " > unicodestringvar:= 'Best'#228'tigung'; > " > produces a different result on fixes_2_6 and trunk? I assume in trunk there > will be a compile error? No. In both cases it results in a widestring with this content: .short 66,101,115,116,228,116,105,103,117,110,103,0 I guess invalid utf-8 values are just copied through by the compiler. As mentioned: absolutely nothing whatsoever changed in how character sequences are interpreted by the compiler in 2.7.x. The explanation you quoted above (and which I deleted) applies to both 2.6.x and 2.7.x. I really don't know how I can say this in another way, and repeating it clearly doesn't help. I think it's best if you compile trunk for yourself and test as many scenarios as you can, because I feel I cannot add anything further to the discussion, and I'm not interested in playing compile bot. Jonas___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
05.01.13, 19:40, Jonas Maebe пишет: You can put anything in it and it may or may not work depending on the current system code page, but afaik the only thing that is guaranteed to work at this time is plain ASCII. ResourceStrings are stored as AnsiString type with 0 codepage (as I remember). Delphi now stores ResourceStrings as UnicodeString type. I think FPC will follow this in m_default_unicodestring modeswitch. Best regards, Paul Ishenin ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On Saturday 05 January 2013 12:28:03 Jonas Maebe wrote: > Alternatively, in both cases you can instead define a > unicodestring/widestring constant instead of an ansistring/shortstring > constant by embedding widechar constants in the character sequence. Such > widechar constants are of the form # with a valid Pascal > representation of an integer constant between 255 and 65535. Then you can > use those widechars to represent the desired characters as UTF-16 code > points. In that case, the entire string will however be parsed as a > sequence of UTF-16 code points (because a string is either a sequence of > ansichars, or a sequence of widechars; it can never be a mixture of the > two), and hence also #1 or #128 appearing in a widestring will be parsed as > widechar(#1) and widechar(#128) as opposed to being interpreted according > to the current codepage setting. > So compiled with -Fcutf8 " unicodestringvar:= 'Best'#228'tigung'; " produces a different result on fixes_2_6 and trunk? I assume in trunk there will be a compile error? We use this form of character constants in MSEgui to have the sources in pure ASCII. Martin ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On 05 Jan 2013, at 12:36, Sven Barth wrote: > On 05.01.2013 12:28, Jonas Maebe wrote: >>> And again, sorry for the impertinence, how do resource strings fit in the >>> string handling scenario of Free Pascal trunk? >> >> Unicode support for resourcestrings is still not available in FPC trunk. >> They can currently still only be used safely for ASCII content. > > What about UTF8 content? You can put anything in it and it may or may not work depending on the current system code page, but afaik the only thing that is guaranteed to work at this time is plain ASCII. Jonas___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On 05.01.2013 12:28, Jonas Maebe wrote: And again, sorry for the impertinence, how do resource strings fit in the string handling scenario of Free Pascal trunk? Unicode support for resourcestrings is still not available in FPC trunk. They can currently still only be used safely for ASCII content. What about UTF8 content? Regards, Sven ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On 05 Jan 2013, at 12:12, Martin Schreiber wrote: > Thank you very much for the detailed explanation. What I could not found in > all the answers (probably it is my ignorance of the English language), is, > does #n mean a utf16 code unit as in Delphi XE3 or does it denote something > other? It was not in the explanation, because it is something that did not change between 2.6.x and 2.7.x. Whatever you use in 2.6.x will still work in exactly the same way in 2.7.x. The Delphi XE3 behaviour may be added to the {$mode delphiunicode} syntax mode, but has not yet been implemented and will never be applied to existing syntax modes. > Assume {$codepage utf-8} how should we enter Russian character constants in > #n > form? Using whatever #xx#xx or #xx#xx#xx sequence represents the UTF-8 encoding of that character. > How should we enter Russian character constants in #n form if > {$codepage 8859-5} is defined? Using whatever #xx represents that character in code page 8859-5. Alternatively, in both cases you can instead define a unicodestring/widestring constant instead of an ansistring/shortstring constant by embedding widechar constants in the character sequence. Such widechar constants are of the form # with a valid Pascal representation of an integer constant between 255 and 65535. Then you can use those widechars to represent the desired characters as UTF-16 code points. In that case, the entire string will however be parsed as a sequence of UTF-16 code points (because a string is either a sequence of ansichars, or a sequence of widechars; it can never be a mixture of the two), and hence also #1 or #128 appearing in a widestring will be parsed as widechar(#1) and widechar(#128) as opposed to being interpreted according to the current codepage setting. > And again, sorry for the impertinence, how do resource strings fit in the > string handling scenario of Free Pascal trunk? Unicode support for resourcestrings is still not available in FPC trunk. They can currently still only be used safely for ASCII content. Jonas___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)
On Saturday 05 January 2013 11:30:42 Jonas Maebe wrote: [...] > > For example, I said that basically nothing changed in 2.7.x compared to > 2.6.x, except that certain string constants are no longer automatically > converted to utf-16 at compile time, and then you ask "Or should we not > touch the theme strings and FPC anymore?". Since basically nothing changed > except for a few less blind auto-conversions at compile time, why should > you no longer be able to touch anything anymore? > > Let me repeat: your string constants will be parsed by the compiler into > character sequences with exactly the same content in both 2.6.x and 2.7.x > (and with content I mean that if they would be converted to the same code > page in 2.6.x and in 2.7.x, you would end up with exactly the same binary > data). Whether or not they contain character literals whose value is >#127 > in the source code's code page, or explicit "#xx", "#xxx" etc expressions > has no influence, nothing changed in the compiler in that account. > > The *only* difference is that the compiler can now internally represent > ansistrings with arbitrary code pages, and as a result the aforementioned > character sequences may now be stored internally in the compiler in a > different format, and also stored in the program in a different format if > that can avoid conversions at run time. In particular, character sequences > are no longer all converted immediately/by default/under all circumstances > to UTF-16 in case characters >#127 need to be interpreted according to a > particular code page (i.e., if a {$codepage xxx} directive is present). > > The compiler will now only convert such character sequences to UTF-16, > still at compile time (just like it was able to do in 2.6.x), if it is > actually assigned to an UTF-16-encoded string, passed to an UTF-16 > parameter etc. And the compiler will also convert it to another ansistring > code page is case the character sequence appeared in e.g. a file with > {$codepage utf-8} and is then assigned to a variable whose type is declared > as "type ansistring(850)". > Thank you very much for the detailed explanation. What I could not found in all the answers (probably it is my ignorance of the English language), is, does #n mean a utf16 code unit as in Delphi XE3 or does it denote something other? You write: > Whether or not they contain character literals whose value is >#127 > in the source code's code page, or explicit "#xx", "#xxx" etc expressions > has no influence, nothing changed in the compiler in that account. Assume {$codepage utf-8} how should we enter Russian character constants in #n form? How should we enter Russian character constants in #n form if {$codepage 8859-5} is defined? And again, sorry for the impertinence, how do resource strings fit in the string handling scenario of Free Pascal trunk? Martin ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel