Hi, 2013/7/9 Michael Schnell <mschn...@lumino.de>
> On 07/09/2013 11:02 AM, Noah Silva wrote: > >> >> I convert it to UTF8 before displaying it.... >> >> Not a good idea. > > Well if the console is UTF8.... > The FPC developers are right now busy implementing the new Delphi Strings. > This _could_ mean that the application programmer can use any encoding > (such as multiple different ANSI byte-codes, UTF-8, UTF-16, ...), but in > fact to be 100% Delphi compatible ("nothing less, nothing more"), it seems > that only UTF-16 will gain full decent support (e.g. class inheritance, in > TStringList, the Lazarus user API etc.) > > Using UTF16 for internal string handling is a sensible option. That's what the Windows API does and what f.e. SAP's ABAP does. On the other hand, UTF8 is very common in files and transferring string data via things like HTTP/XML, so it has to be fully supported any way around it. OS X uses UTF8 as the "local" encoding (so you never have to worry there, except in Java), and apparently so does GTK2. To make it simple, UTF8 can be used everywhere that ANSI encodings were used, because it is ASCII compatible when only ASCII is used. UTF16 and UTF32 can't easily be substituted because they contain "padding" bytes for normal ASCII. For things like WideString, this is fine. The reason UTF16/UTF32 are popular for in-memory variables is that it's easy to achieve higher performance. For example, with UTF32, there is no need to "decode" the string to find out what character you are on, how many bytes a certain character takes up, etc. If you want the 4th character of a string, you simply go to the 4th 4 byte array element and retrieve the value. UTF16 works the same way if you are dealing with the 99% of characters in use that take only two bytes (but this leads to bugs because people usually don't handle the remaining 1% properly). So you gain processing speed and code simplicity by using UTF16 or UTF32. You lose out on memory if you are dealing with ASCII data only - which is no big deal in most cases. UTF8 saves memory and is more ASCII compatible, but requires more decoding/encoding. Since they represent the same character set, it doesn't really matter in the end - there is a trade-off either way. If you are doing mainly I/O, UTF8 is convenient, if you are doing heavy duty string processing, UTF32 is convenient. Either way, not supporting one or the other is simply not an option if you want to be able to write Unicode compliant programs. One can be the "main" way used by internal routines, and this is UTF16 in more operating systems than not. If FPC used UTF8 for everything and automatically converted it, then calls to the Win32 API would be slowed down by this, so it makes sense to use UTF16 on Windows, but... then again if GTK2 requires UTF8 then you have the same (but opposite) problem there. Lazarus also doesn't support only Windows, so we have to think a little wider than Delphi. Another interesting point is that I have heard no end of complaints about Delphi's Unicode strategy, so while we want to be compatible, perhaps we should consider how to do that while possibly avoiding some of the same pitfalls. To address your specific points: 1.Lazarus User API already supports UTF8 so far as I know. 2. TStringList could easily support both, but as long as the conversion to/from other code pages (especially UTF8) is automatic, I wouldn't mind. 3. Not sure what class inheritance has to do with this... -Michael Thank you, Noah Silva p.s.: Unicode is an area that I know a lot about, so if anyone working on the RTL needs help testing, let me know... > ______________________________**_________________ > fpc-pascal maillist - > fpc-pascal@lists.freepascal.**org<fpc-pascal@lists.freepascal.org> > http://lists.freepascal.org/**mailman/listinfo/fpc-pascal<http://lists.freepascal.org/mailman/listinfo/fpc-pascal> >
_______________________________________________ fpc-pascal maillist - fpc-pascal@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-pascal