[fpc-devel] ansistrings and widestrings

2005-01-04 Thread peter green
if i do ansistringvar := widestringvar or widestringvar := ansistringvar
what does the compiler do?

1: use the systems default encoding (if so obtained from where?)
2: use utf-8
3: use iso-8859-1
4: use something else?

furthermore if the encoding used is one not capable of representing all
unicode code points what are the reduction rules used in the conversion from
widestring to ansistring?


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] ansistrings and widestrings

2005-01-04 Thread Alexey Barkovoy
if i do ansistringvar := widestringvar or widestringvar := ansistringvar
what does the compiler do?
1: use the systems default encoding (if so obtained from where?)
2: use utf-8
3: use iso-8859-1
4: use something else?
furthermore if the encoding used is one not capable of representing all
unicode code points what are the reduction rules used in the conversion from
widestring to ansistring?
Compiler internally uses Wide2AnsiMoveProc and Ansi2WideMoveProc functions which 
can be reassigned by user. Currently their are just maping lower 128 chars in 
one representation to other. On Windows there are system functions that can be 
used to do conversion: MultiByteToWideChar and WideCharToMultiByte - these 
functions can take into account specified or globally set code page.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


RE: [fpc-devel] ansistrings and widestrings

2005-01-05 Thread peter green
where are theese default versions located in the code?


> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] Behalf Of Alexey
> Barkovoy
> Sent: 05 January 2005 07:36
> To: FPC developers' list
> Subject: Re: [fpc-devel] ansistrings and widestrings
>
>
> > if i do ansistringvar := widestringvar or widestringvar := ansistringvar
> > what does the compiler do?
> >
> > 1: use the systems default encoding (if so obtained from where?)
> > 2: use utf-8
> > 3: use iso-8859-1
> > 4: use something else?
> >
> > furthermore if the encoding used is one not capable of representing all
> > unicode code points what are the reduction rules used in the
> conversion from
> > widestring to ansistring?
>
> Compiler internally uses Wide2AnsiMoveProc and Ansi2WideMoveProc
> functions which
> can be reassigned by user. Currently their are just maping lower
> 128 chars in
> one representation to other. On Windows there are system
> functions that can be
> used to do conversion: MultiByteToWideChar and
> WideCharToMultiByte - these
> functions can take into account specified or globally set code page.
>
>
> ___
> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> http://lists.freepascal.org/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


RE: [fpc-devel] ansistrings and widestrings

2005-01-05 Thread Michael Van Canneyt


On Wed, 5 Jan 2005, peter green wrote:

> where are theese default versions located in the code?
>

In the inc directory of the rtl. wstrings.inc

procedure Wide2AnsiMove(source:pwidechar;dest:pchar;len:SizeInt);
procedure Ansi2WideMove(source:pchar;dest:pwidechar;len:SizeInt);

Michael.
>
> > -Original Message-
> > From: [EMAIL PROTECTED]
> > [mailto:[EMAIL PROTECTED] Behalf Of Alexey
> > Barkovoy
> > Sent: 05 January 2005 07:36
> > To: FPC developers' list
> > Subject: Re: [fpc-devel] ansistrings and widestrings
> >
> >
> > > if i do ansistringvar := widestringvar or widestringvar := ansistringvar
> > > what does the compiler do?
> > >
> > > 1: use the systems default encoding (if so obtained from where?)
> > > 2: use utf-8
> > > 3: use iso-8859-1
> > > 4: use something else?
> > >
> > > furthermore if the encoding used is one not capable of representing all
> > > unicode code points what are the reduction rules used in the
> > conversion from
> > > widestring to ansistring?
> >
> > Compiler internally uses Wide2AnsiMoveProc and Ansi2WideMoveProc
> > functions which
> > can be reassigned by user. Currently their are just maping lower
> > 128 chars in
> > one representation to other. On Windows there are system
> > functions that can be
> > used to do conversion: MultiByteToWideChar and
> > WideCharToMultiByte - these
> > functions can take into account specified or globally set code page.
> >
> >
> > ___
> > fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> > http://lists.freepascal.org/mailman/listinfo/fpc-devel
>
>
> ___
> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> http://lists.freepascal.org/mailman/listinfo/fpc-devel
>

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


RE: [fpc-devel] ansistrings and widestrings

2005-01-05 Thread peter green
i found something slightly worrying in that code
  @-8  : SizeInt for reference count;
  @-4  : SizeInt for size;
  @: String + Terminating #0;
a Sizeint isn't always 4 bytes!!

> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] Behalf Of Michael Van
> Canneyt
> Sent: 05 January 2005 16:11
> To: FPC developers' list
> Subject: RE: [fpc-devel] ansistrings and widestrings
> 
> 
> 
> 
> On Wed, 5 Jan 2005, peter green wrote:
> 
> > where are theese default versions located in the code?
> >
> 
> In the inc directory of the rtl. wstrings.inc
> 
> procedure Wide2AnsiMove(source:pwidechar;dest:pchar;len:SizeInt);
> procedure Ansi2WideMove(source:pchar;dest:pwidechar;len:SizeInt);
> 
> Michael.
> >
> > > -Original Message-
> > > From: [EMAIL PROTECTED]
> > > [mailto:[EMAIL PROTECTED] Behalf Of Alexey
> > > Barkovoy
> > > Sent: 05 January 2005 07:36
> > > To: FPC developers' list
> > > Subject: Re: [fpc-devel] ansistrings and widestrings
> > >
> > >
> > > > if i do ansistringvar := widestringvar or widestringvar := 
> ansistringvar
> > > > what does the compiler do?
> > > >
> > > > 1: use the systems default encoding (if so obtained from where?)
> > > > 2: use utf-8
> > > > 3: use iso-8859-1
> > > > 4: use something else?
> > > >
> > > > furthermore if the encoding used is one not capable of 
> representing all
> > > > unicode code points what are the reduction rules used in the
> > > conversion from
> > > > widestring to ansistring?
> > >
> > > Compiler internally uses Wide2AnsiMoveProc and Ansi2WideMoveProc
> > > functions which
> > > can be reassigned by user. Currently their are just maping lower
> > > 128 chars in
> > > one representation to other. On Windows there are system
> > > functions that can be
> > > used to do conversion: MultiByteToWideChar and
> > > WideCharToMultiByte - these
> > > functions can take into account specified or globally set code page.
> > >
> > >
> > > ___
> > > fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> > > http://lists.freepascal.org/mailman/listinfo/fpc-devel
> >
> >
> > ___
> > fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> > http://lists.freepascal.org/mailman/listinfo/fpc-devel
> >
> 
> ___
> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> http://lists.freepascal.org/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


RE: [fpc-devel] ansistrings and widestrings

2005-01-05 Thread peter green
ok i see a MAJOR problem with the semantics of those functions.

they assume that one widechar is equivilent to one ansichar (that is the
source count of widechars will equal the destination count of ansichars or
the source count of widechars will equal the destination count of
ansichars).

this is simply not the case for many encodings. (utf-8 sjis euc to name just
a few)


> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] Behalf Of Michael Van
> Canneyt
> Sent: 05 January 2005 16:11
> To: FPC developers' list
> Subject: RE: [fpc-devel] ansistrings and widestrings
>
>
>
>
> On Wed, 5 Jan 2005, peter green wrote:
>
> > where are theese default versions located in the code?
> >
>
> In the inc directory of the rtl. wstrings.inc
>
> procedure Wide2AnsiMove(source:pwidechar;dest:pchar;len:SizeInt);
> procedure Ansi2WideMove(source:pchar;dest:pwidechar;len:SizeInt);
>
> Michael.
> >
> > > -Original Message-
> > > From: [EMAIL PROTECTED]
> > > [mailto:[EMAIL PROTECTED] Behalf Of Alexey
> > > Barkovoy
> > > Sent: 05 January 2005 07:36
> > > To: FPC developers' list
> > > Subject: Re: [fpc-devel] ansistrings and widestrings
> > >
> > >
> > > > if i do ansistringvar := widestringvar or widestringvar :=
> ansistringvar
> > > > what does the compiler do?
> > > >
> > > > 1: use the systems default encoding (if so obtained from where?)
> > > > 2: use utf-8
> > > > 3: use iso-8859-1
> > > > 4: use something else?
> > > >
> > > > furthermore if the encoding used is one not capable of
> representing all
> > > > unicode code points what are the reduction rules used in the
> > > conversion from
> > > > widestring to ansistring?
> > >
> > > Compiler internally uses Wide2AnsiMoveProc and Ansi2WideMoveProc
> > > functions which
> > > can be reassigned by user. Currently their are just maping lower
> > > 128 chars in
> > > one representation to other. On Windows there are system
> > > functions that can be
> > > used to do conversion: MultiByteToWideChar and
> > > WideCharToMultiByte - these
> > > functions can take into account specified or globally set code page.
> > >
> > >
> > > ___
> > > fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> > > http://lists.freepascal.org/mailman/listinfo/fpc-devel
> >
> >
> > ___
> > fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> > http://lists.freepascal.org/mailman/listinfo/fpc-devel
> >
>
> ___
> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> http://lists.freepascal.org/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] ansistrings and widestrings

2005-01-05 Thread Alexey Barkovoy
Well functions are called ANSI to unicode and vice versa. ANSI is always single 
byte; by unicode people usually refer to utf16, not multibyte encoding and both 
Delphi and FPC define WideString as double byte strings. So semantically 
functions do that is required. IMHO then assigning widestring to ansistring 
noone should expect multibyte encoded result. Then you need utf-8 you should 
call special functions.

- Original Message - 
From: "peter green" <[EMAIL PROTECTED]>
To: "FPC developers' list" 
Sent: Wednesday, January 05, 2005 8:32 PM
Subject: RE: [fpc-devel] ansistrings and widestrings


ok i see a MAJOR problem with the semantics of those functions.
they assume that one widechar is equivilent to one ansichar (that is the
source count of widechars will equal the destination count of ansichars or
the source count of widechars will equal the destination count of
ansichars).
this is simply not the case for many encodings. (utf-8 sjis euc to name just
a few)

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of Michael Van
Canneyt
Sent: 05 January 2005 16:11
To: FPC developers' list
Subject: RE: [fpc-devel] ansistrings and widestrings
On Wed, 5 Jan 2005, peter green wrote:
> where are theese default versions located in the code?
>
In the inc directory of the rtl. wstrings.inc
procedure Wide2AnsiMove(source:pwidechar;dest:pchar;len:SizeInt);
procedure Ansi2WideMove(source:pchar;dest:pwidechar;len:SizeInt);

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


RE: [fpc-devel] ansistrings and widestrings

2005-01-05 Thread peter green
in wondows terminology (which i presume is where the name ansistring comes
from) the windows code page which is often refered to in documentation as
the ansi code page CAN be multi byte.

http://www.microsoft.com/globaldev/reference/WinCP.mspx

more generally i belive an ansistring is usually intended to represent text
in the platforms local encoding. Whilst a widestring is meant to represent
text in utf-16.

The platforms local encoding may be a single byte encodeing (iso-8859-?
windows-125? etc) it may be a legacy mixed width encoding (EUC-?? SHIFT-JIS
BIG5 etc) or it may be a unicode transformation format which is a superset
of ascii (utf-8).

now for dependency reasons i belive that the default conversion functions
should remain a "dumb fallback" BUT i also belive that the function
prototypes should be designed in such a way as to allow the conversion
routines to be replaced with ones that can sesiblly handle the local
encoding.

i've created a page on the wiki for this issue at
http://www.freepascal.org/wiki/index.php/Widestrings


> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] Behalf Of Alexey
> Barkovoy
> Sent: 05 January 2005 19:55
> To: FPC developers' list
> Subject: Re: [fpc-devel] ansistrings and widestrings
>
>
> Well functions are called ANSI to unicode and vice versa. ANSI is
> always single
> byte; by unicode people usually refer to utf16, not multibyte
> encoding and both
> Delphi and FPC define WideString as double byte strings. So semantically
> functions do that is required. IMHO then assigning widestring to
> ansistring
> noone should expect multibyte encoded result. Then you need utf-8
> you should
> call special functions.
>
> - Original Message -
> From: "peter green" <[EMAIL PROTECTED]>
> To: "FPC developers' list" 
> Sent: Wednesday, January 05, 2005 8:32 PM
> Subject: RE: [fpc-devel] ansistrings and widestrings
>
>
> > ok i see a MAJOR problem with the semantics of those functions.
> >
> > they assume that one widechar is equivilent to one ansichar (that is the
> > source count of widechars will equal the destination count of
> ansichars or
> > the source count of widechars will equal the destination count of
> > ansichars).
> >
> > this is simply not the case for many encodings. (utf-8 sjis euc
> to name just
> > a few)
> >
> >
> >> -----Original Message-
> >> From: [EMAIL PROTECTED]
> >> [mailto:[EMAIL PROTECTED] Behalf Of Michael Van
> >> Canneyt
> >> Sent: 05 January 2005 16:11
> >> To: FPC developers' list
> >> Subject: RE: [fpc-devel] ansistrings and widestrings
> >>
> >>
> >> On Wed, 5 Jan 2005, peter green wrote:
> >>
> >> > where are theese default versions located in the code?
> >> >
> >>
> >> In the inc directory of the rtl. wstrings.inc
> >>
> >> procedure Wide2AnsiMove(source:pwidechar;dest:pchar;len:SizeInt);
> >> procedure Ansi2WideMove(source:pchar;dest:pwidechar;len:SizeInt);
> >>
>
>
> ___
> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> http://lists.freepascal.org/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] ansistrings and widestrings

2005-01-05 Thread Alexey Barkovoy
Firstly: I agree that Wide2AnsiMoveProc and Ansi2WideMoveProc should take size 
of resulting string.

Next: I was wrong about ansistrings - on Windows their are (PCHAR's) used (until 
WinNT arrived) in far east localized versions coupled with multibyte encoding. 
So currenltly for legacy applications multibyte encoded character sets are 
supported in any WinNT box.

PS. I hope mine patch (bug 3451) extending Widestring support in compiler will 
finally be applied to CVS and we can proceed with RTL modifications to support 
more extended ansi to wide strings conversions. ;-)

PPS. AFAIK UTF-8 is not used internally in any OS - it's only used for storing 
UNICODE text in more compact form - web site authors really like it.

- Original Message - 
From: "peter green" <[EMAIL PROTECTED]>
To: "FPC developers' list" 
Sent: Thursday, January 06, 2005 12:19 AM
Subject: RE: [fpc-devel] ansistrings and widestrings


in wondows terminology (which i presume is where the name ansistring comes
from) the windows code page which is often refered to in documentation as
the ansi code page CAN be multi byte.
http://www.microsoft.com/globaldev/reference/WinCP.mspx
more generally i belive an ansistring is usually intended to represent text
in the platforms local encoding. Whilst a widestring is meant to represent
text in utf-16.
The platforms local encoding may be a single byte encodeing (iso-8859-?
windows-125? etc) it may be a legacy mixed width encoding (EUC-?? SHIFT-JIS
BIG5 etc) or it may be a unicode transformation format which is a superset
of ascii (utf-8).
now for dependency reasons i belive that the default conversion functions
should remain a "dumb fallback" BUT i also belive that the function
prototypes should be designed in such a way as to allow the conversion
routines to be replaced with ones that can sesiblly handle the local
encoding.
i've created a page on the wiki for this issue at
http://www.freepascal.org/wiki/index.php/Widestrings

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


RE: [fpc-devel] ansistrings and widestrings

2005-01-05 Thread peter green

> PPS. AFAIK UTF-8 is not used internally in any OS - it's only
> used for storing
> UNICODE text in more compact form - web site authors really like it.

i belive a lot of linux distros are switching to it for the console at least
for less common languages i don't know how gui stuff on linux handles text.
The windows routines for going from utf-16 to local codesets and back can
also go from utf-16 to utf-7 and utf-8 and back but i don't think windows
itself actually makes any real use of those encodings.

UTF-8 is smaller than UTF-16 in some cases larger in others and about the
same in still others it largely depends on what code points dominate the
text. An appropriate national encoding will usually always beat both of them
if it can represent the needed code points.

mainly $00-$7F utf-8 : 1 byte  utf-16: 2 bytes utf-32 4 bytes.
mainly $80-$0007FF utf-8 : 2 bytes utf-16: 2 bytes utf-32 4 bytes.
mainly $000800-$00 utf-8 : 3 bytes utf-16: 2 bytes utf-32 4 bytes.
mainly $01-$10 utf-8 : 4 bytes utf-16: 4 bytes utf-32 4 bytes.

the net result is that utf-8 tends to win for largely latin languages UTF-16
tends to win for largely ideographic languages and they are about on a par
for everything else. utf-32 nearly always loses to both (though it does have
a large spare codespace which can be used for special meanings internal to
the app).

the main advatages of utf-8 over utf-16 are
1: is a superset of 7 bit ascii
2: its not peppperd with 0 bytes.
3: any charachtor can ONLY be represented by 1 byte pattern and that byte
patten can ONLY represent that charachtor (it can't be a part of another
charachtor)
4: its easy to resync a badly cut/joined stream (if you cut a utf-16 stream
in the middle of a charachtor on of the peices will be total garbage).

the net result is that most code designed to deal with "ascii with
extentions" can be fed utf-8 and it will usually work fine or only require
minimal changes.

i still belive that the best way to handle ansistring<-->widestring
conversion is to use a fallback conversion (either 7 bit ascii or
iso-8859-1) by default and then provide units that override the conversion
with versions based on the local charset of the environment or a charset
specified by the application coder. Unfortunately as i have said whilst
there is an interface in place for overriding the conversion it is currently
only usable where the local code is single byte rather than mixed width.







___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


RE: [fpc-devel] ansistrings and widestrings

2005-01-05 Thread Peter Vreman
> in wondows terminology (which i presume is where the name ansistring comes
> from) the windows code page which is often refered to in documentation as
> the ansi code page CAN be multi byte.
>
> http://www.microsoft.com/globaldev/reference/WinCP.mspx
>
> more generally i belive an ansistring is usually intended to represent
> text
> in the platforms local encoding. Whilst a widestring is meant to represent
> text in utf-16.
>
> The platforms local encoding may be a single byte encodeing (iso-8859-?
> windows-125? etc) it may be a legacy mixed width encoding (EUC-??
> SHIFT-JIS
> BIG5 etc) or it may be a unicode transformation format which is a superset
> of ascii (utf-8).
>
> now for dependency reasons i belive that the default conversion functions
> should remain a "dumb fallback" BUT i also belive that the function
> prototypes should be designed in such a way as to allow the conversion
> routines to be replaced with ones that can sesiblly handle the local
> encoding.
>
> i've created a page on the wiki for this issue at
> http://www.freepascal.org/wiki/index.php/Widestrings

You are welcome to supply patches that fixes the prototypes and new units
that support more encoding/decoding routines.






___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] ansistrings and widestrings

2005-01-06 Thread Florian Klaempfl
Peter Vreman wrote:
in wondows terminology (which i presume is where the name ansistring comes
from) the windows code page which is often refered to in documentation as
the ansi code page CAN be multi byte.
http://www.microsoft.com/globaldev/reference/WinCP.mspx
more generally i belive an ansistring is usually intended to represent
text
in the platforms local encoding. Whilst a widestring is meant to represent
text in utf-16.
The platforms local encoding may be a single byte encodeing (iso-8859-?
windows-125? etc) it may be a legacy mixed width encoding (EUC-??
SHIFT-JIS
BIG5 etc) or it may be a unicode transformation format which is a superset
of ascii (utf-8).
now for dependency reasons i belive that the default conversion functions
should remain a "dumb fallback" BUT i also belive that the function
prototypes should be designed in such a way as to allow the conversion
routines to be replaced with ones that can sesiblly handle the local
encoding.
i've created a page on the wiki for this issue at
http://www.freepascal.org/wiki/index.php/Widestrings

You are welcome to supply patches that fixes the prototypes and new units
that support more encoding/decoding routines.
I think we should introduce a class widestringmanager :) Lower, upper, 
comparing etc. needs also to take care of unicode encodings.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] ansistrings and widestrings

2005-01-06 Thread DrDiettrich
peter green wrote:
> 
> ok i see a MAJOR problem with the semantics of those functions.
> 
> they assume that one widechar is equivilent to one ansichar (that is the
> source count of widechars will equal the destination count of ansichars or
> the source count of widechars will equal the destination count of
> ansichars).
> 
> this is simply not the case for many encodings. (utf-8 sjis euc to name just
> a few)

I came across such problems in another project (CrossPoint). IMO the
best solution is a separation into true fixed-char strings (1, 2, 4?
byte/char), and a true string class for more general encodings. The
string class(es) then also can include proper support for code pages,
MBCS, 7-bit codes, MIME etc.

The only universal international representation for strings is Unicode
(currently 32 bit), that doesn't require any conversions. UTF and other
encodings can save memory, but only at the cost of runtime overhead,
that's why I'd wrap these into classes.

Delphi uses AnsiString for both single and multi byte character strings,
and I'm not sure whether WideChar (as used by Windows) is Unicode-16 or
UTF-16. In international applications (mail!) the handling of such
strings can become a mess, when the assumptions about the encoding of
some string (code page...) don't hold. When consequently records are
used to hold strings together with an indication of the actual encoding,
then a dedicated standard string class would be a better solution.

DoDi



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] ansistrings and widestrings

2005-01-07 Thread Florian Klaempfl
DrDiettrich wrote:
peter green wrote:
ok i see a MAJOR problem with the semantics of those functions.
they assume that one widechar is equivilent to one ansichar (that is the
source count of widechars will equal the destination count of ansichars or
the source count of widechars will equal the destination count of
ansichars).
this is simply not the case for many encodings. (utf-8 sjis euc to name just
a few)

I came across such problems in another project (CrossPoint). IMO the
best solution is a separation into true fixed-char strings (1, 2, 4?
byte/char), and a true string class for more general encodings. The
string class(es) then also can include proper support for code pages,
MBCS, 7-bit codes, MIME etc.
The only universal international representation for strings is Unicode
(currently 32 bit), that doesn't require any conversions. 
That's not true. E.g. the german umlauts can be represented by 2 chars 
when using UTF-32 (the char and the two dots), same apply to a lot of 
other languages.

UTF and other
UTF-8 is unicode as well, unicode is a standard which decribes char 
mappings and encodings besides other things.

encodings can save memory, but only at the cost of runtime overhead,
that's why I'd wrap these into classes.
Delphi uses AnsiString for both single and multi byte character strings,
and I'm not sure whether WideChar (as used by Windows) is Unicode-16 or
UTF-16. In international applications (mail!) the handling of such
strings can become a mess, when the assumptions about the encoding of
some string (code page...) don't hold. When consequently records are
used to hold strings together with an indication of the actual encoding,
then a dedicated standard string class would be a better solution.
Encoding isn't the main problem, you need dedicated procecures and 
functions for unicode comparision, upper/lower conversion etc. To achive 
this platfrom independend is very hard ...

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] ansistrings and widestrings

2005-01-07 Thread DrDiettrich
Florian Klaempfl wrote:

> > The only universal international representation for strings is Unicode
> > (currently 32 bit), that doesn't require any conversions.
> 
> That's not true. E.g. the german umlauts can be represented by 2 chars
> when using UTF-32 (the char and the two dots), same apply to a lot of
> other languages.

Okay, this is where I didn't understand the difference between code
points and whatsoever. Doesn't in the umlaut and accented case exist a
unique glyph and according code, that could be used in the first place?
In other languages (Arabic...) the glyph may vary with the context, here
I have no idea how to compare such text, but the native writers
(speakers) of such glyphs should know ;-)

> Encoding isn't the main problem, you need dedicated procecures and
> functions for unicode comparision, upper/lower conversion etc.

Agreed, these will become the string class methods. It may be necessary
to partition Unicode into code pages, with different methods for
comparison etc.

In the worst case, if we cannot find or agree about a so-far unique
representation for text, an "uncomparable" value has to become a valid
result of a comparison.


> To achive this platfrom independend is very hard ...

How that? I agree that here the existence of definitely
compatible/portable OS services is not guaranteed. But when the methods
have to be implemented for platforms that do not have such services at
all, then these implementations can be used on all other platforms as
well.


All in all I'd say that we do not intend to implement a text processing
or translation system. What we can do is to define a string or text
class, that contains text in a well defined form, for processing with
all specified methods. The key point is the import of text into an
object of any such class. If no appropriate class has been implemented,
the import is simply impossible. Inside, i.e. between these classes, all
the methods should work. Perhaps with graceful "uncomparable" or
"unconvertable" results, when somebody insists in using incompletly
implemented classes.
We don't want the impossible, the doable will be sufficient ;-)

DoDi


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


RE: [fpc-devel] ansistrings and widestrings

2005-01-07 Thread peter green
it should be noted that pascal classes are really not suited to doing
strings.

to do strings with classes you really need language features which fpc
doesn't have.

doing strings with non garbage collected heap based classes would make
something that was as painfull to work with as pchars and that was totally
different from any string handling pascal has seen before.

just as pascal doesn't consider two strings with different cases to be equal
it should probbablly not consider two strings of unicode code points to be
equal unless they are binary equivilent.

conversion between ansistring and widestring should be done by functions
that take one and returns the other (use a const param to avoid the implicit
try-finally) so that no limitations are put on how the conversion is done.
Theese functions should be indirected through procvars so that the default
fallback versions can be replaced by versions supplied by a unit which
provides proper internationalisation.



> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] Behalf Of DrDiettrich
> Sent: 07 January 2005 15:06
> To: FPC developers' list
> Subject: Re: [fpc-devel] ansistrings and widestrings
>
>
> Florian Klaempfl wrote:
>
> > > The only universal international representation for strings is Unicode
> > > (currently 32 bit), that doesn't require any conversions.
> >
> > That's not true. E.g. the german umlauts can be represented by 2 chars
> > when using UTF-32 (the char and the two dots), same apply to a lot of
> > other languages.
>
> Okay, this is where I didn't understand the difference between code
> points and whatsoever. Doesn't in the umlaut and accented case exist a
> unique glyph and according code, that could be used in the first place?
> In other languages (Arabic...) the glyph may vary with the context, here
> I have no idea how to compare such text, but the native writers
> (speakers) of such glyphs should know ;-)
>
> > Encoding isn't the main problem, you need dedicated procecures and
> > functions for unicode comparision, upper/lower conversion etc.
>
> Agreed, these will become the string class methods. It may be necessary
> to partition Unicode into code pages, with different methods for
> comparison etc.
>
> In the worst case, if we cannot find or agree about a so-far unique
> representation for text, an "uncomparable" value has to become a valid
> result of a comparison.
>
>
> > To achive this platfrom independend is very hard ...
>
> How that? I agree that here the existence of definitely
> compatible/portable OS services is not guaranteed. But when the methods
> have to be implemented for platforms that do not have such services at
> all, then these implementations can be used on all other platforms as
> well.
>
>
> All in all I'd say that we do not intend to implement a text processing
> or translation system. What we can do is to define a string or text
> class, that contains text in a well defined form, for processing with
> all specified methods. The key point is the import of text into an
> object of any such class. If no appropriate class has been implemented,
> the import is simply impossible. Inside, i.e. between these classes, all
> the methods should work. Perhaps with graceful "uncomparable" or
> "unconvertable" results, when somebody insists in using incompletly
> implemented classes.
> We don't want the impossible, the doable will be sufficient ;-)
>
> DoDi
>
>
> ___
> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> http://lists.freepascal.org/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] ansistrings and widestrings

2005-01-07 Thread Alexey Barkovoy
- Original Message - 
From: "peter green" <[EMAIL PROTECTED]>
To: "FPC developers' list" 
Sent: Friday, January 07, 2005 7:24 PM
Subject: RE: [fpc-devel] ansistrings and widestrings


it should be noted that pascal classes are really not suited to doing
strings.
to do strings with classes you really need language features which fpc
doesn't have.
doing strings with non garbage collected heap based classes would make
something that was as painfull to work with as pchars and that was totally
different from any string handling pascal has seen before.
Yes, classes are not suitable here, but FPC already allows mechanizm to redefine 
string handling with Get / SetWideStringManager. This can be extended / reworked 
to include short, wide and ansi string comparition routines.

just as pascal doesn't consider two strings with different cases to be equal
it should probbablly not consider two strings of unicode code points to be
equal unless they are binary equivilent.
But comparision is not only equal / nonequal, but  "bigger" and "lesser" 
too !
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] ansistrings and widestrings

2005-01-08 Thread DrDiettrich
peter green wrote:
 
> it should be noted that pascal classes are really not suited to doing
> strings.

IMO we should distinguish Strings, as containers, from Text as an
interpretation of data as, ahem, text of some language, in some
encoding, possibly with attributes...

> to do strings with classes you really need language features which fpc
> doesn't have.

Please explain?

> doing strings with non garbage collected heap based classes would make
> something that was as painfull to work with as pchars and that was totally
> different from any string handling pascal has seen before.

FPC has reference counted string and array types, so that GC is
available.

> just as pascal doesn't consider two strings with different cases to be equal
> it should probbablly not consider two strings of unicode code points to be
> equal unless they are binary equivilent.

That's one of the differences between strings and text. All comparable
data types must have associated comparison functions. For numbers and
strings the standard comparison functions are part of the language
(operators), which usually do a simple binary compare. For other data
types such operators can be defined as appropriate. It should be noted
that a comparison for anything but (strict) equality requires
interpretation rules for the data types. E.g. comparing even ordinal
numbers depends on the byte order of the machine, comparing strings
depends on many more attributes, like mappings for upper/lower case.
That's why a programming language, for itself, will supply only
"primitive" string comparisons, that have reasonable restrictions so
that an implementation should be possible for any platform.

> conversion between ansistring and widestring should be done by functions
> that take one and returns the other (use a const param to avoid the implicit
> try-finally) so that no limitations are put on how the conversion is done.

This applies to all string handling procedures. A modification of
non-const string parameters opens a can of worms (aliasing...)!

> Theese functions should be indirected through procvars so that the default
> fallback versions can be replaced by versions supplied by a unit which
> provides proper internationalisation.

(Inter)nationalization goes far beyond any "standard" features. Dealing
with natural languages IMO requires more than only dictionaries and
hard-coded translation rules. Every natural language can have their own
rules, how e.g. the words in a message must be modified or rearranged
when message arguments shall be inserted into the text.

IMO we must distinguish between the handling of Characters, Strings and
Text. For the alphabets (character sets) of natural languages it should
be possible to implement functions to compare and convert characters;
such support often is built into the OS, for selected languages. This is
the level where multibyte characters can come in, so that just a
Character can be different from any fixed-size data type, and that the
same Character can have multiple representations - remember your umlaut
example? Nonetheless the rules on the Character level at least are quite
well defined, so that it's possible to implement according standard
procedures for comparison and conversion. Of course these procedures
require parameters like the language and the encoding of the characters,
so that IMO exchangable and configurable classes are the best containers
for characters.

Strings can be considered as arrays of Characters, so that the string
handling procedures can use the character handling procedures.
Everything else, that requires more than processing an stream of
individual characters, is beyond the scope of standard procedures. Here
it can become problematic when a string just contains words from
different languages, because then an automatic detection of the language
and according rules can not be guaranteed. That's why I hold the
programmer liable for the correct description of whatever he puts into a
string object.

DoDi



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] ansistrings and widestrings

2005-01-09 Thread Marco van de Voort
> peter green wrote:
>  
> > it should be noted that pascal classes are really not suited to doing
> > strings.
> 
> IMO we should distinguish Strings, as containers, from Text as an
> interpretation of data as, ahem, text of some language, in some
> encoding, possibly with attributes...
> 
> > to do strings with classes you really need language features which fpc
> > doesn't have.
> 
> Please explain?
> 
> > doing strings with non garbage collected heap based classes would make
> > something that was as painfull to work with as pchars and that was totally
> > different from any string handling pascal has seen before.
> 
> FPC has reference counted string and array types, so that GC is
> available.

Peter probably means that to make custom string types, you need to have a way
to define operations and conversions. In Java, C++ this is possible afaik.

In C++ because it is a template, in Java because compiler manages classes.

> IMO we must distinguish between the handling of Characters, Strings and
> Text. For the alphabets (character sets) of natural languages it should
> be possible to implement functions to compare and convert characters;
> such support often is built into the OS, for selected languages. 

That's problem 1: on Unix that part of the OS exists, but is not
standarised. This not being standarised is the main reason for avoiding
linking every program to these libs.

> This is the level where multibyte characters can come in, so that just a
> Character can be different from any fixed-size data type, and that the
> same Character can have multiple representations - remember your umlaut
> example? Nonetheless the rules on the Character level at least are quite
> well defined, so that it's possible to implement according standard
> procedures for comparison and conversion.

> Of course these procedures
> require parameters like the language and the encoding of the characters,
> so that IMO exchangable and configurable classes are the best containers
> for characters.

The problem with string-classes is that you loose all automatism. This
complicates each and every operation where new strings are created from old
ones. This is what Peter was hinting at.
 
Personally, I still think it would be best to have 2 types of widestrings
(UTF8 - UTF16), with automatic conversions between them.  GNU is a UTF8 world,
Windows typically uses an own encoding that is more UTF16-like)

(UTF32 is rarely used, since afaik it is mostly for dead languages and
uncommon writing styles of east Asian languages. Moreover it indeed afaik
doesn't hold the often cited advantage that it has fixed length chars.
diacritic modifiers exist here too. However since most combinations also
have a formal codepoint, I don't know if that can be solved (e.g. by merging
them) )


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] ansistrings and widestrings

2005-01-09 Thread Alexey Barkovoy
- Original Message - 
From: "Marco van de Voort" <[EMAIL PROTECTED]>
To: "FPC developers' list" 
Sent: Sunday, January 09, 2005 2:53 PM
Subject: Re: [fpc-devel] ansistrings and widestrings


This is the level where multibyte characters can come in, so that just a
Character can be different from any fixed-size data type, and that the
same Character can have multiple representations - remember your umlaut
example? Nonetheless the rules on the Character level at least are quite
well defined, so that it's possible to implement according standard
procedures for comparison and conversion.

Of course these procedures
require parameters like the language and the encoding of the characters,
so that IMO exchangable and configurable classes are the best containers
for characters.
The problem with string-classes is that you loose all automatism. This
complicates each and every operation where new strings are created from old
ones. This is what Peter was hinting at.
So, seems best approach here is to leave compiler generated code for equality 
and comparision as a plain binary comparision of bytes (btw. it's the way Delphi 
does) and introduce set of string handling functions that should be aware of 
language depended encoding.

To current compiler implementation this means changing of
Type
 TWide2AnsiMove=procedure(source:pwidechar;dest:pchar;len:SizeInt);
 TAnsi2WideMove=procedure(source:pchar;dest:pwidechar;len:SizeInt);
to
Type
 // Lenght paremeters are number of CHARS not bytes
 TWide2AnsiMove=function(source:pwidechar; srclen:SizeInt; dest:pansichar; 
destlen:SizeInt): SizeInt;
 TAnsi2WideMove=function(source:pansichar; srclen:SizeInt; dest:pwidechar; 
destlen:SizeInt): SizeInt;

These functions should return actual number of characters in output. Returning 
ZERO should indicate insufficient destination size. In Windows 
WideCharToMultiByte can return needed number of characters in output buffer, but 
LIBICONV (http://www.gnu.org/software/libiconv/ - library suited for all 
UNIX'es) doesn't allow this. So common solution (if result of conversion will be 
stored in AnsiString or WideString) is just to enlarge output buffer untill 
TWide2AnsiMove / TAnsi2WideMove return non zero value.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


RE: [fpc-devel] ansistrings and widestrings

2005-01-09 Thread peter green

> Type
>   // Lenght paremeters are number of CHARS not bytes
>   TWide2AnsiMove=function(source:pwidechar; srclen:SizeInt;
> dest:pansichar;
> destlen:SizeInt): SizeInt;
>   TAnsi2WideMove=function(source:pansichar; srclen:SizeInt;
> dest:pwidechar;
> destlen:SizeInt): SizeInt;
>
> These functions should return actual number of characters in
> output. Returning
> ZERO should indicate insufficient destination size.

yes theese would be workabable but they seem to me to be a horrible Cism

whats wrong with

twidestringtoansistring=procedure(const source : widestring;var dest :
ansistring);
tansistringtowidestring=procedure(const source : ansistring;var dest :
widestring);



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] ansistrings and widestrings

2005-01-10 Thread Alexey Barkovoy
- Original Message - 
From: "peter green" <[EMAIL PROTECTED]>
To: "FPC developers' list" 
Sent: Sunday, January 09, 2005 11:45 PM
Subject: RE: [fpc-devel] ansistrings and widestrings

Type
  // Lenght paremeters are number of CHARS not bytes
  TWide2AnsiMove=function(source:pwidechar; srclen:SizeInt;
dest:pansichar;
destlen:SizeInt): SizeInt;
  TAnsi2WideMove=function(source:pansichar; srclen:SizeInt;
dest:pwidechar;
destlen:SizeInt): SizeInt;
These functions should return actual number of characters in
output. Returning
ZERO should indicate insufficient destination size.
yes theese would be workabable but they seem to me to be a horrible Cism
whats wrong with
twidestringtoansistring=procedure(const source : widestring;var dest :
ansistring);
tansistringtowidestring=procedure(const source : ansistring;var dest :
widestring);
Because we need to transform from PChar to WideString or from PWideChar to 
AnsiString or from Array [0..xx] of Char to WideString, etc. 

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel