Re: [lazarus] UTF-8 vs UTF-16 support

2007-10-08 Thread Luca Olivetti

En/na Marco Ciampa ha escrit:

On Fri, Oct 05, 2007 at 01:14:23PM +0200, Luca Olivetti wrote:

En/na [EMAIL PROTECTED] ha escrit:


* WideString allows indexed [] accessing individual chars.
This does not seem to be correct. I read that utf16 can be 4 byte long.. 
Then calculation is needed sometimes...
Unless you're dealing with klingon and ancient languages, 

Like Chinese? Just a billion people use it...not a real problem at all...
:-\


I (wrongly) thought that chines was in the bmp :-(




I think you can assume that for 99.99% of currently spoken languages every
character will be exactly 2 bytes long. 

Wrong as I said before.

There's a risk of having some character with more that 2 bytes but it is 
a small risk. 
With utf-8 the risk is bigger, so you have always to traverse 
the string if you need access to a specific character index.
You have to go through the string for UTF-8 and UTF-16 encodings 
so the advantages are at least questionable... 


Yes, but my (wrong) premise is that you could assume all characters are 
2 bytes wide, so the Nth character would be at N*2 byte.


Bye
--
Luca Olivetti
Wetron Automatización S.A. http://www.wetron.es/
Tel. +34 93 5883004  Fax +34 93 5883007

_
To unsubscribe: mail [EMAIL PROTECTED] with
   unsubscribe as the Subject
  archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] UTF-8 vs UTF-16 support

2007-10-08 Thread Luca Olivetti

En/na Luca Olivetti ha escrit:

You have to go through the string for UTF-8 and UTF-16 encodings so 
the advantages are at least questionable... 


Yes, but my (wrong) premise is that you could assume all characters are 
2 bytes wide, so the Nth character would be at N*2 byte.


BTW, using strings as arrays of char to get at individual characters is 
risky business with utf-8. Or will be they converted to (pseudo) 
properties and (slowly) do the (slow) right thing?
I also suppose that the functions in strutils are not utf-8 aware, so 
what should we be using in its place?


Bye
--
Luca Olivetti
Wetron Automatización S.A. http://www.wetron.es/
Tel. +34 93 5883004  Fax +34 93 5883007

_
To unsubscribe: mail [EMAIL PROTECTED] with
   unsubscribe as the Subject
  archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] UTF-8 vs UTF-16 support

2007-10-08 Thread Razvan Adrian Bogdan
On 10/8/07, Luca Olivetti [EMAIL PROTECTED] wrote:
 En/na Luca Olivetti ha escrit:

  You have to go through the string for UTF-8 and UTF-16 encodings so
  the advantages are at least questionable...
 
  Yes, but my (wrong) premise is that you could assume all characters are
  2 bytes wide, so the Nth character would be at N*2 byte.

 BTW, using strings as arrays of char to get at individual characters is
 risky business with utf-8. Or will be they converted to (pseudo)
 properties and (slowly) do the (slow) right thing?
 I also suppose that the functions in strutils are not utf-8 aware, so
 what should we be using in its place?

For single character processing UTF32 (4bytes) would be nice :), i
think functions to count UTF8 chars inside a string and getting each
char would be nice too, maybe even implemented in FPC for UTF8string
such as Lenght(utf8string) or indexing utf8string[1] to return the
char not the byte as UTF32.

Since FPC uses ANSI strings, a lot and most text is in latin1 without
any diacritics using UTF8 in Lazarus is a good choice, if the right
functions are provided it can be a great choice unless apps become too
slow.

Since the web uses mostly UTF8 for minimizing transfered data and also
most databases for minimal storage size it becomes clear that UTF8 is
a better choice if helper functions exist to assist with it's
management.

Razvan

_
 To unsubscribe: mail [EMAIL PROTECTED] with
unsubscribe as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] UTF-8 vs UTF-16 support

2007-10-08 Thread Mattias Gärtner
Zitat von Razvan Adrian Bogdan [EMAIL PROTECTED]:

 On 10/8/07, Luca Olivetti [EMAIL PROTECTED] wrote:
  En/na Luca Olivetti ha escrit:
 
   You have to go through the string for UTF-8 and UTF-16 encodings so
   the advantages are at least questionable...
  
   Yes, but my (wrong) premise is that you could assume all characters are
   2 bytes wide, so the Nth character would be at N*2 byte.
 
  BTW, using strings as arrays of char to get at individual characters is
  risky business with utf-8.

It's the same with UTF-16 and with treating UTF-16 as UCS-2. UTF-32 is almost
there. (some languages combine characters. I dont know the relevance.)

For most string operations, like computing the byte length or comparing strings
ASCII case insensitive, UTF-8 is 100% compatible.
Because of the UTF-8 encoding, you can even start in the middle of string and
find out if the byte is the first, second, third or fourth byte of a character.
So, existing algorithms don't need to change at whole to work with UTF-8. Same
is true for UCS-2 code and UTF-16.


  Or will be they converted to (pseudo)
  properties and (slowly) do the (slow) right thing?
  I also suppose that the functions in strutils are not utf-8 aware, so
  what should we be using in its place?

 For single character processing UTF32 (4bytes) would be nice :), i
 think functions to count UTF8 chars inside a string and getting each
 char would be nice too, maybe even implemented in FPC for UTF8string
 such as Lenght(utf8string) or indexing utf8string[1] to return the
 char not the byte as UTF32.

See lcl/lclproc.pas search for UTF8.
Some of these functions already exists in the RTL. The others may be moved
eventually.


 Since FPC uses ANSI strings, a lot and most text is in latin1 without
 any diacritics using UTF8 in Lazarus is a good choice, if the right
 functions are provided it can be a great choice unless apps become too
 slow.

In lazarus most UTF-8 code is in synedit. The synedit slow down from ASCII to
UTF-8 was hardly measurable. Even if ignoring the fact that 90%-98% of the time
is spent in the widgetset.


 Since the web uses mostly UTF8 for minimizing transfered data and also
 most databases for minimal storage size it becomes clear that UTF8 is
 a better choice if helper functions exist to assist with it's
 management.


Mattias

_
 To unsubscribe: mail [EMAIL PROTECTED] with
unsubscribe as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] UTF-8 vs UTF-16 support

2007-10-08 Thread Graeme Geldenhuys
On 08/10/2007, Razvan Adrian Bogdan [EMAIL PROTECTED] wrote:
 char would be nice too, maybe even implemented in FPC for UTF8string
 such as Lenght(utf8string) or indexing utf8string[1] to return the
 char not the byte as UTF32.

In fpGUI I have a few helper functions for UTF-8 strings (Length,
Copy, Delete, Insert, Pos etc). Some of the code I got from LCLProc
unit and some written myself.

Anybody know how I can access UTF-8 characters via a index?   eg;
MyString[2] returns the string containing the 2nd character. I say
returning a string, because a UTF-8 characters can be between 1-4
bytes so a Char type will not do.


Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/

_
 To unsubscribe: mail [EMAIL PROTECTED] with
unsubscribe as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] UTF-8 vs UTF-16 support

2007-10-08 Thread Mattias Gärtner
Zitat von Graeme Geldenhuys [EMAIL PROTECTED]:

 On 08/10/2007, Razvan Adrian Bogdan [EMAIL PROTECTED] wrote:
  char would be nice too, maybe even implemented in FPC for UTF8string
  such as Lenght(utf8string) or indexing utf8string[1] to return the
  char not the byte as UTF32.

 In fpGUI I have a few helper functions for UTF-8 strings (Length,
 Copy, Delete, Insert, Pos etc). Some of the code I got from LCLProc
 unit and some written myself.

 Anybody know how I can access UTF-8 characters via a index?   eg;
 MyString[2] returns the string containing the 2nd character. I say
 returning a string, because a UTF-8 characters can be between 1-4
 bytes so a Char type will not do.

If you want an array, then you can convert the string to UTF-32 or create an
array of PChar pointing to each character.
If you just want the n-th utf-8 character, then you can use UTF8CharStart.
If you need the n-th visible character (including BIDI and combined characters)
then you must use functions from the iconv lib or the winapi. Same for UTF-16
and UTF-32.
There is no encoding for random access to an unicode string.


Mattias

_
 To unsubscribe: mail [EMAIL PROTECTED] with
unsubscribe as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] UTF-8 vs UTF-16 support

2007-10-08 Thread Luca Olivetti

En/na Mattias Gärtner ha escrit:


For most string operations, like computing the byte length or comparing strings
ASCII case insensitive, UTF-8 is 100% compatible.


but not if you need char length, say limiting a text to 40 characters 
and indicating there that the text has been truncated with '..':



if length(s)40 then s:=copy(s,1,38)+'..';

or maybe faster

if length(s)40 then
begin
  s[39]:='.';
  s[40]:='.';
  setlength(s,40);
end;

would break with utf-8 (and with utf-16 too if you use characters 
outside the bmp). There are probably utf-8 equivalents of the above, but 
old habits die hard
Maybe for internal processing utf-32 is better and only use utf-8 for 
input/output and/or interface with other systems?


Bye
--
Luca Olivetti
Wetron Automatización S.A. http://www.wetron.es/
Tel. +34 93 5883004  Fax +34 93 5883007

_
To unsubscribe: mail [EMAIL PROTECTED] with
   unsubscribe as the Subject
  archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] UTF-8 vs UTF-16 support

2007-10-08 Thread Mattias Gärtner
Zitat von Luca Olivetti [EMAIL PROTECTED]:

 En/na Mattias Gärtner ha escrit:

  For most string operations, like computing the byte length or comparing
 strings
  ASCII case insensitive, UTF-8 is 100% compatible.

 but not if you need char length, say limiting a text to 40 characters
 and indicating there that the text has been truncated with '..':


 if length(s)40 then s:=copy(s,1,38)+'..';

 or maybe faster

 if length(s)40 then
 begin
s[39]:='.';
s[40]:='.';
setlength(s,40);
 end;

 would break with utf-8 (and with utf-16 too if you use characters
 outside the bmp). There are probably utf-8 equivalents of the above, but
 old habits die hard

if UTF8Length(s)40 then s:=UTF8Copy(s,1,38)+'..';


 Maybe for internal processing utf-32 is better and only use utf-8 for
 input/output and/or interface with other systems?

:)

Speed: Depends on what you do: UTF-8, UTF-16, UTF-32
Memory: UTF-8 or UTF-16.
Compatibility: UTF-8 (VCL)
Easy coding: UTF-32

There is no absolute winner.

Mattias

_
 To unsubscribe: mail [EMAIL PROTECTED] with
unsubscribe as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] UTF-8 vs UTF-16 support

2007-10-07 Thread Felipe Monteiro de Carvalho
Hi,

I was surfing wikipedia and I found a good reason why not to use
UCS-2. It seams to be prohibited to distribute software in mainland
china that only partially supports the chinese characters (like is the
case for UCS-2).

Source:

http://en.wikipedia.org/wiki/GB18030

In a move of historic significance for software supporting Unicode,
the PRC decided to mandate support of certain code points outside the
BMP. This means that software can no longer get away with treating
characters as 16 bit fixed width entities (UCS-2). Therefore they must
either process the data in a variable width format (such as UTF-8 or
UTF-16), which are the most common choices, or move to a larger fixed
width format (such as UCS-4 or UTF-32). Microsoft made the change from
UCS-2 to UTF-16 with Windows 2000.

Of course, if your don't plan on distributing software on China, this
is irrelevant, but a general purpose library needs to take this into
account.

-- 
Felipe Monteiro de Carvalho

_
 To unsubscribe: mail [EMAIL PROTECTED] with
unsubscribe as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] UTF-8 vs UTF-16 support

2007-10-07 Thread Marco Ciampa
On Fri, Oct 05, 2007 at 01:14:23PM +0200, Luca Olivetti wrote:
 En/na [EMAIL PROTECTED] ha escrit:

 * WideString allows indexed [] accessing individual chars.
 This does not seem to be correct. I read that utf16 can be 4 byte long.. 
 Then calculation is needed sometimes...

 Unless you're dealing with klingon and ancient languages, 
Like Chinese? Just a billion people use it...not a real problem at all...
:-\

 I think you can assume that for 99.99% of currently spoken languages every
 character will be exactly 2 bytes long. 
Wrong as I said before.

 There's a risk of having some character with more that 2 bytes but it is 
 a small risk. 
 With utf-8 the risk is bigger, so you have always to traverse 
 the string if you need access to a specific character index.
You have to go through the string for UTF-8 and UTF-16 encodings 
so the advantages are at least questionable... 

ciao

-- 

Marco Ciampa

++
| Linux User  #78271 |
| FSFE fellow   #364 |
++

_
 To unsubscribe: mail [EMAIL PROTECTED] with
unsubscribe as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] UTF-8 vs UTF-16 support

2007-10-05 Thread Michael Van Canneyt


On Fri, 5 Oct 2007, Graeme Geldenhuys wrote:

 Hi,
 
 I asked a similar question in the MSEgui newsgroup as well.  What was
 the reason for choosing to support UTF-8 instead of UTF-16?
 
 - Quoted Mattias from 6 months ago  --
 The LCL will support UTF-8 and provide some extra functions for UTF-16,
 because UTF-8 is more compatible to existing pascal programs
 ---   END   --
 
 
 Does this mean UTF-8 was chosen only because it is more compatible
 with existing pascal programs?  Any other reasons?

It uses less memory.

 
 These are the pro points I received for using UTF-16 in MSEgui.
 
 * It is faster to work with UTF-16 (and so WideString) encoded text
 compared to UTF-8.
 * Easier to implement.
 * WideString allows indexed [] accessing individual chars.
 * Has predictable length() value.  (not sure what they meant here)

It means BufferSize = Length*Sizeof(Widechar). 
On UTF-8, you need to calculate it.

Michael.

_
 To unsubscribe: mail [EMAIL PROTECTED] with
unsubscribe as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] UTF-8 vs UTF-16 support

2007-10-05 Thread Mattias Gaertner
On Fri, 5 Oct 2007 09:27:59 +0200
Graeme Geldenhuys [EMAIL PROTECTED] wrote:

 Hi,
 
 I asked a similar question in the MSEgui newsgroup as well.  What was
 the reason for choosing to support UTF-8 instead of UTF-16?
 
 - Quoted Mattias from 6 months ago  --
 The LCL will support UTF-8 and provide some extra functions for
 UTF-16, because UTF-8 is more compatible to existing pascal programs
 ---   END   --
 
 
 Does this mean UTF-8 was chosen only because it is more compatible
 with existing pascal programs?  Any other reasons?
 
 These are the pro points I received for using UTF-16 in MSEgui.
 
 * It is faster to work with UTF-16 (and so WideString) encoded text
 compared to UTF-8.
 * Easier to implement.
 * WideString allows indexed [] accessing individual chars.
 * Has predictable length() value.  (not sure what they meant here)

This all assumes UTF-16 has only 2-byte characters, but there are
4-byte characters too.
The above is true for UTF-32.


 * Most widget toolkits and libraries have WideString API's already.
 (Win32, Xft, Xlib etc..)

And all platforms have functions for UTF-8.


The main reason is:
UTF-8 is more compatible to existing pascal programs, because they use
'string', not widestring.


Mattias

_
 To unsubscribe: mail [EMAIL PROTECTED] with
unsubscribe as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] UTF-8 vs UTF-16 support

2007-10-05 Thread Vincent Snijders

Michael Van Canneyt schreef:


On Fri, 5 Oct 2007, Graeme Geldenhuys wrote:


Hi,

I asked a similar question in the MSEgui newsgroup as well.  What was
the reason for choosing to support UTF-8 instead of UTF-16?

- Quoted Mattias from 6 months ago  --
The LCL will support UTF-8 and provide some extra functions for UTF-16,
because UTF-8 is more compatible to existing pascal programs
---   END   --


Does this mean UTF-8 was chosen only because it is more compatible
with existing pascal programs?  Any other reasons?


It uses less memory.


These are the pro points I received for using UTF-16 in MSEgui.

* It is faster to work with UTF-16 (and so WideString) encoded text
compared to UTF-8.
* Easier to implement.
* WideString allows indexed [] accessing individual chars.
* Has predictable length() value.  (not sure what they meant here)


It means BufferSize = Length*Sizeof(Widechar). 
On UTF-8, you need to calculate it.


I think they mean numofchar(widestring) = bytes allocated / 2. For an UTF8 string 
you need to parse it, to get the length.


So length(widestring) is a O(1) operation, lenght(UTF8String) is a O(n) 
operation.

Vincent

_
To unsubscribe: mail [EMAIL PROTECTED] with
   unsubscribe as the Subject
  archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] UTF-8 vs UTF-16 support

2007-10-05 Thread Paul Ishenin

Graeme Geldenhuys wrote:

Does this mean UTF-8 was chosen only because it is more compatible
with existing pascal programs?  Any other reasons?



Is UTF-16 cover all languages? As I know it have problems with Chinese 
and/or Japanese languages. While utf-8 doesnot have such problems. More 
over most software uses English as default language. UTF-8 encoded 
English words are still the same as non-encoded English words.


Btw, I dont know other advantages.

Best regards,
Paul Ishenin.

_
To unsubscribe: mail [EMAIL PROTECTED] with
   unsubscribe as the Subject
  archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] UTF-8 vs UTF-16 support

2007-10-05 Thread Mattias Gaertner
On Fri, 5 Oct 2007 09:36:59 +0200 (CEST)
Michael Van Canneyt [EMAIL PROTECTED] wrote:

 
 
 On Fri, 5 Oct 2007, Graeme Geldenhuys wrote:
 
  Hi,
  
  I asked a similar question in the MSEgui newsgroup as well.  What
  was the reason for choosing to support UTF-8 instead of UTF-16?
  
  - Quoted Mattias from 6 months ago  --
  The LCL will support UTF-8 and provide some extra functions for
  UTF-16, because UTF-8 is more compatible to existing pascal programs
  ---   END   --
  
  
  Does this mean UTF-8 was chosen only because it is more compatible
  with existing pascal programs?  Any other reasons?
 
 It uses less memory.
 
  
  These are the pro points I received for using UTF-16 in MSEgui.
  
  * It is faster to work with UTF-16 (and so WideString) encoded text
  compared to UTF-8.
  * Easier to implement.
  * WideString allows indexed [] accessing individual chars.
  * Has predictable length() value.  (not sure what they meant here)
 
 It means BufferSize = Length*Sizeof(Widechar). 

This works only for 'most' languages, so this trick can only be used
for specific applications. 
A LCL interface should support the full encoding, which means it
must calculate the length of UTF-16.


 On UTF-8, you need to calculate it.

@Graeme: google for UTF-8 UTF-16 comparison to find lots of arguments
for both sides.


Mattias

_
 To unsubscribe: mail [EMAIL PROTECTED] with
unsubscribe as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] UTF-8 vs UTF-16 support

2007-10-05 Thread Mattias Gaertner
On Fri, 05 Oct 2007 16:00:41 +0800
Paul Ishenin [EMAIL PROTECTED] wrote:

 Graeme Geldenhuys wrote:
  Does this mean UTF-8 was chosen only because it is more compatible
  with existing pascal programs?  Any other reasons?
  
 
 Is UTF-16 cover all languages? As I know it have problems with
 Chinese and/or Japanese languages. While utf-8 doesnot have such
 problems. More over most software uses English as default language.
 UTF-8 encoded English words are still the same as non-encoded English
 words.
 
 Btw, I dont know other advantages.

UTF-8, UTF-16 and UTF-32 are just different encodings for the same
unicode characterset.

UTF-16 is often confused with UCS-2, which is indeed only 2-byte
characters and has the widestring advantage (length=#words). But
for the price, that it does not support all characters. That's why M$
switched from UCS-2 to UTF-16 keeping the W functions, which may be one
of the main reasons for the confusion.


Mattias

_
 To unsubscribe: mail [EMAIL PROTECTED] with
unsubscribe as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] UTF-8 vs UTF-16 support

2007-10-05 Thread ik
On 10/5/07, Mattias Gaertner [EMAIL PROTECTED] wrote:
 On Fri, 05 Oct 2007 16:00:41 +0800
 Paul Ishenin [EMAIL PROTECTED] wrote:

  Graeme Geldenhuys wrote:
   Does this mean UTF-8 was chosen only because it is more compatible
   with existing pascal programs?  Any other reasons?
  
 
  Is UTF-16 cover all languages? As I know it have problems with
  Chinese and/or Japanese languages. While utf-8 doesnot have such
  problems. More over most software uses English as default language.
  UTF-8 encoded English words are still the same as non-encoded English
  words.
 
  Btw, I dont know other advantages.

 UTF-8, UTF-16 and UTF-32 are just different encodings for the same
 unicode characterset.

 UTF-16 is often confused with UCS-2, which is indeed only 2-byte
 characters and has the widestring advantage (length=#words). But
 for the price, that it does not support all characters. That's why M$
 switched from UCS-2 to UTF-16 keeping the W functions, which may be one
 of the main reasons for the confusion.

As far as I know the Unicode organization no longer support in UCS-2
and recommend that any implementation of such encoding will be used as
UTF-16.

Another issue, is that on UTF-8 I think that some of the languages
such as Korean and Japanese does not include all of the symbols it
requires, but I'm not sure.

I believe that all the encoding should be supported, and be used
according to the way that the developers of the software will decide
rather then to force them in choosing specific encoding.




 Mattias


Ido
-- 
http://ik.homelinux.org/

_
 To unsubscribe: mail [EMAIL PROTECTED] with
unsubscribe as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] UTF-8 vs UTF-16 support

2007-10-05 Thread anteusz

Graeme Geldenhuys wrote:

Hi,

I asked a similar question in the MSEgui newsgroup as well.  What was
the reason for choosing to support UTF-8 instead of UTF-16?

- Quoted Mattias from 6 months ago  --
The LCL will support UTF-8 and provide some extra functions for UTF-16,
because UTF-8 is more compatible to existing pascal programs
---   END   --


Does this mean UTF-8 was chosen only because it is more compatible
with existing pascal programs?  Any other reasons?

These are the pro points I received for using UTF-16 in MSEgui.

* It is faster to work with UTF-16 (and so WideString) encoded text
compared to UTF-8.
* Easier to implement.
* WideString allows indexed [] accessing individual chars.
* Has predictable length() value.  (not sure what they meant here)
* Most widget toolkits and libraries have WideString API's already.
(Win32, Xft, Xlib etc..)



Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/

_
 To unsubscribe: mail [EMAIL PROTECTED] with
unsubscribe as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives


  


* WideString allows indexed [] accessing individual chars.

This does not seem to be correct. I read that utf16 can be 4 byte long.. 
Then calculation is needed sometimes...


Marton Papp

_
To unsubscribe: mail [EMAIL PROTECTED] with
   unsubscribe as the Subject
  archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] UTF-8 vs UTF-16 support

2007-10-05 Thread Mattias Gaertner
On Fri, 5 Oct 2007 10:45:18 +0200
ik [EMAIL PROTECTED] wrote:

 On 10/5/07, Mattias Gaertner [EMAIL PROTECTED] wrote:
  On Fri, 05 Oct 2007 16:00:41 +0800
  Paul Ishenin [EMAIL PROTECTED] wrote:
 
   Graeme Geldenhuys wrote:
Does this mean UTF-8 was chosen only because it is more
compatible with existing pascal programs?  Any other reasons?
   
  
   Is UTF-16 cover all languages? As I know it have problems with
   Chinese and/or Japanese languages. While utf-8 doesnot have such
   problems. More over most software uses English as default
   language. UTF-8 encoded English words are still the same as
   non-encoded English words.
  
   Btw, I dont know other advantages.
 
  UTF-8, UTF-16 and UTF-32 are just different encodings for the same
  unicode characterset.
 
  UTF-16 is often confused with UCS-2, which is indeed only 2-byte
  characters and has the widestring advantage (length=#words). But
  for the price, that it does not support all characters. That's why
  M$ switched from UCS-2 to UTF-16 keeping the W functions, which may
  be one of the main reasons for the confusion.
 
 As far as I know the Unicode organization no longer support in UCS-2
 and recommend that any implementation of such encoding will be used as
 UTF-16.
 
 Another issue, is that on UTF-8 I think that some of the languages
 such as Korean and Japanese does not include all of the symbols it
 requires, but I'm not sure.
 
 I believe that all the encoding should be supported, and be used
 according to the way that the developers of the software will decide
 rather then to force them in choosing specific encoding.

For compatibility, complexity and usability reasons the LCL should use
only one encoding. For example TControl.Caption is a string on all
platforms. There will be no CaptionW or CaptionA or CaptionUTF32,
because this would be more confusing than it would help. Of course
FPC/Laz provides converter functions for those prefering widestring or
UTF-16 or UTF-32.
The LCL are visual components, so the speed cost of converting the
strings is hardly measurable against the cost of drawing the unicode
characters on the screen. OTOH it can matter if you often traverse a
tree with ten thousand nodes. 
Looking at the lazarus code the LCL encoding of UTF-8 was a
good choice, because the multibyte encoding is only important in
synedit and the LCL interfaces. With UTF-16 additional conversions
would be needed for all text file operations including codetools, which
would slow down a lot.


Mattias

_
 To unsubscribe: mail [EMAIL PROTECTED] with
unsubscribe as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] UTF-8 vs UTF-16 support

2007-10-05 Thread Felipe Monteiro de Carvalho
On 10/5/07, Luca Olivetti [EMAIL PROTECTED] wrote:
 Unless you're dealing with klingon and ancient languages, I think you
 can assume that for 99.99% of currently spoken languages every character
 will be exactly 2 bytes long.

You are forgetting about chinese. Some billion people speak it =) And
you can't represent all chinese characters with ucs-2

-- 
Felipe Monteiro de Carvalho

_
 To unsubscribe: mail [EMAIL PROTECTED] with
unsubscribe as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] UTF-8 vs UTF-16 support

2007-10-05 Thread Luca Olivetti

En/na [EMAIL PROTECTED] ha escrit:


* WideString allows indexed [] accessing individual chars.

This does not seem to be correct. I read that utf16 can be 4 byte long.. 
Then calculation is needed sometimes...


Unless you're dealing with klingon and ancient languages, I think you 
can assume that for 99.99% of currently spoken languages every character 
will be exactly 2 bytes long. There's a risk of having some character 
with more that 2 bytes but it is a small risk.
With utf-8 the risk is bigger, so you have always to traverse the string 
if you need access to a specific character index.


--
Luca

_
To unsubscribe: mail [EMAIL PROTECTED] with
   unsubscribe as the Subject
  archives at http://www.lazarus.freepascal.org/mailarchives