Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-24 Thread Michael Schnell


Your comments are absolutely vague and meaningless. 


Sorry, but this was discussed already several times, so I supposed that 
the problems I see are known to the discussion members:


But here a simple example Lazarus project with all options left in 
standard setting:


procedure TForm1.Button1Click(Sender: TObject);
var
sAnsiString: AnsiString;
sUTF8String: UTF8String;
sWideString: WideString;
begin
sAnsiString:='üu';
sUTF8String:='üu';
sWideString:='üu';
Memo1.Lines.Add('1) ' + IntToHex(integer(sAnsiString[1]), 
sizeof(char)*2) + ' ' +

IntToHex(integer(sAnsiString[2]), sizeof(char)*2) +
' should be FC 75');
Memo1.Lines.Add('2) ' + IntToHex(integer(sUTF8String[1]), 
sizeof(char)*2) + ' ' +

IntToHex(integer(sUTF8String[2]), sizeof(char)*2) +
' should be C3 BC');
Memo1.Lines.Add('3) ' + IntToHex(integer(sWideString[1]), 
sizeof(WideChar)*2) + ' ' +
IntToHex(integer(sWideString[2]), 
sizeof(WideChar)*2) +

' should be 00FC 0075');
end;

This results in

1) C3 BC should be FC 75
2) C3 BC should be C3 BC
3) 00C3 00BC should be 00FC 0075



You don't need to tell me why the result is as it is, I do know the 
details, but for me this really is not at all desirable, as any 
newcomer will get hit by this as soon as he tries to do any string handling.


Comment:

1) The type is named ANSIString and so anybody will suppose it in fact 
holds data of this type (ANSI code according to the system's locale) - 
unless you do something else with it in your user program, but obviously 
it does not (with German locale on Windows the ANSI code of ü is $FC ).


2) This in fact is as expected, provided you know that UTF8Strings are 
counted in code-elements rather than in code-points (aka Unicode 
Characters). But I feel that anybody who does not explicitly uses 
Unicode will assume character (notwithstanding that an utf8character is 
not defined in FPC). But you legally can claim that anybody who really 
wants to do Unicode should make himself comfortable with the details of 
UTF8.


3) Assigning a string constant to a WideString does not work as 
expected. The result is not a legal UTF16 representing the constant the 
user wrote.




Not to mention
thay also don't propose an alternative.
  
In these discussions I provided a lot of suggestions (that might or 
might not be sensible) but of course the executive teams (FPC and 
Lazarus) themselves need to decide what to do. (The FPC team seem to 
intend to introduce strings that dynamically know the coding it contains.)

Sorry to be blunt, but so were your comments.
  
Sorry if I sounded blunt. I'm very happy and thankful that there are 
volunteers who dedicate their spare time to make things like FPC and 
Lazarus happen. My ranting was meant to help them improve Lazarus and 
FPC usability.


While the previous Lazarus version's string handling worked as expected 
with ANSIString, the new version forces utf8 coding onto the user, even 
if he is perfectly happy with the locale-depending ANSI he is used to. 
IMHO this only is harmful (shooing away potential users), as it in 
standard situation it does not work exactly as the old ANSIString handling.


-Michael



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-24 Thread Michael Schnell



if compiled using *none* utf8 mode.
I did not find a way to set none utf8 mode with Lazarus, so that I 
just can use ANSIString (and WideString) like I did in the previous version.


Did I miss this option ?

If it exists, why not set same as default so that it works for someone 
ignoring Unicode.


(But I suppose this is prevented by the UTF8API of LCL and the FPC not 
being able to tell ANSIString from UTF8String. )


-Michael (We are turning in Circles on that issue)
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-24 Thread Michael Schnell




It is works for win32 only for now. Only system unit is finished. Work 
in progress...

Sounds great so far !

Is there a document on how exactly it is going to work (will a common 
String type get a dynamic coding specification or will there be 
different String types for any coding variants ?


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-24 Thread Jeff Wormsley


Martin Friebe wrote:
I must agree with the FPC can not to it all automatically line (as 
much as I regret, and admit the beauty there was, if fpc could).


What I mean is:

1) Any Application/Program, that currently compiles and works (using 
none utf8, never mind if ascii or ansi) will keep working, if compiled 
using *none* utf8 mode.
This is reasonable.  It also implies that perhaps what everyone is 
trying to do is impossible.  With plain strings, or Ansi strings, we 
have code that works today.  If you change any of those to UTF*, then 
code that uses things such as SetLength, Length, stringvar[index], 
copy(string, index, count), pos etc. cannot work 100% reliably.  You 
don't know what the programmer wants when he says stringvar[3].  Does he 
mean the third character in the string?  Or the third byte in the memory 
array represented by the string (perhaps he was using a string as a 
buffer)?  If you assume one or the other, when one element of a string 
doesn't equal one byte, half of the time you'll be wrong, it doesn't 
matter which UTF type you are using, what locale you are in, or 
anything.  It almost seems to me, that if you want to use UTF strings as 
the default, you should either throw errors or at least stern warnings 
on any use of Length, SetLength, stringvar[index] et all and force any 
code using them to be rewritten with UTF variants.  It would make more 
sense to knowingly say all code using such constructs is broken in a 
Unicode environment than to leave it to chance that the way the code now 
interprets these constructs is the way the coder originally intended.


I know much of my code would break just using AnsiString as opposed to 
the original counted string.  For me, *any* UTF* version discussed here 
would break it even more.


I don't have any need for Unicode, so feel free to ignore anything I 
say.  But I don't want my code breaking in unpredictable ways, either, 
because the underlying string types change on me behind my back (ie, in 
the RTL/FCL).


Jeff.
--
I haven't smoked for 2 years, 3 months and 1 week, saving $3,736.95 and 
not smoking 24,913.01 cigarettes.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-24 Thread Graeme Geldenhuys
On Mon, Nov 24, 2008 at 3:55 PM, Jeff Wormsley [EMAIL PROTECTED] wrote:
 such as SetLength, Length, stringvar[index], copy(string, index, count), pos
 etc. cannot work 100% reliably.  You don't know what the programmer wants
 when he says stringvar[3].  Does he mean the third character in the string?
  Or the third byte in the memory array represented by the string (perhaps he
 was using a string as a buffer)?

That is why I currently use CharAt(str, i) in my projects and fpGUI -
instead of direct array access. CharAt() handles ANSI and UTF-8
strings perfectly.  Yes it might be slower, but I hardly ever need
character access for the type of applications I am writing. So using
CharAt() once or twice in my application is not a performance problem.


Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-24 Thread Michael Schnell


With plain strings, or Ansi strings, we have code that works today.  
If you change any of those to UTF*, then code that uses things such as 
SetLength, Length, stringvar[index], copy(string, index, count), pos 
etc. cannot work 100% reliably.  You don't know what the programmer 
wants when he says stringvar[3].
That is what the two types ANSString and UTF8String suggest: if you use 
ANSIString, everything works fine as it always did, if you use 
UTF8String you need to take a look at what Unicode handling is all 
about. But unfortunately the compiler does not know the difference 
between the two types and can't do the appropriate conversions if 
necessary (e.g. when accessing the LCL that uses UTF8String) or call the 
appropriate functions (like for doing uppercase) according to what 
type(s) are used in an operation.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-24 Thread Yury Sidorov

From: Michael Schnell [EMAIL PROTECTED]


It is works for win32 only for now. Only system unit is finished. 
Work in progress...

Sounds great so far !

Is there a document on how exactly it is going to work (will a 
common String type get a dynamic coding specification or will there 
be different String types for any coding variants ?


No docoment is available yet. This branch is still experimental. It 
introduces RtlString - string type which is native to RTL on 
corresponding target. RtlString=utf16string on windows, 
RtlString=utf8string for unix, etc.
Also RtlString can be ansistring. In this case RTL will be ANSI only 
and 100% compatible with existing ANSI user code.


It is planned to allow users to build unicode or ansi RTL.

Yury. 
___

fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-21 Thread Michael Schnell



http://wiki.freepascal.org/FPC_Unicode_support#Roadmap_of_RTL_Unicode_support
  


This page does not talk about UTF8Strings being counted in code elements 
vs in code points.


I don't consider it understood that they in any case are counted in code 
elements. IMHO this should be seriously discussed and a solution should 
be found that the user can select either way to be able to do either 
fast code or not break old code.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-21 Thread Felipe Monteiro de Carvalho
On Fri, Nov 21, 2008 at 7:30 AM, Michael Schnell [EMAIL PROTECTED] wrote:
 This page does not talk about UTF8Strings being counted in code elements vs
 in code points.

 I don't consider it understood that they in any case are counted in code
 elements. IMHO this should be seriously discussed and a solution should be
 found that the user can select either way to be able to do either fast code
 or not break old code.

I prefer it to be counted in bytes. If it is counted in Bytes then I
can build a routine that counts in real chars. And we already have a
lot of code to handle utf-8 inside ansisstring which depends on that.

Counting the elements in real chars is very ineficient.

-- 
Felipe Monteiro de Carvalho
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-21 Thread Michael Schnell



I prefer it to be counted in bytes. If it is counted in Bytes then I
can build a routine that counts in real chars. And we already have a
lot of code to handle utf-8 inside ansisstring which depends on that.

Counting the elements in real chars is very ineficient.
  
This is commonly agreed, But counting in code elements breaks old code 
counting in code points sometimes is more handy. That is why I vote for 
making the default syntax (s[i], pos(), copy(), ...) user selectable, 
while of course providing dedicated functions for both flavors.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-21 Thread Graeme Geldenhuys
On Fri, Nov 21, 2008 at 11:30 AM, Michael Schnell [EMAIL PROTECTED] wrote:

 http://wiki.freepascal.org/FPC_Unicode_support#Roadmap_of_RTL_Unicode_support

 This page does not talk about UTF8Strings being counted in code elements vs
 in code points.

I only added the roadmap section, the rest of the content existed
before. You are welcome to amend the content.


Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-21 Thread Michael Schnell



I only added the roadmap section, the rest of the content existed
before. You are welcome to amend the content.
  
I'd rightfully be severely bashed by those who actually will be required 
to do the work ;) .


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-21 Thread Sergei Gorelkin

Michael Schnell wrote:



I prefer it to be counted in bytes. If it is counted in Bytes then I
can build a routine that counts in real chars. And we already have a
lot of code to handle utf-8 inside ansisstring which depends on that.

Counting the elements in real chars is very ineficient.
  
This is commonly agreed, But counting in code elements breaks old code 
counting in code points sometimes is more handy. That is why I vote for 
making the default syntax (s[i], pos(), copy(), ...) user selectable, 
while of course providing dedicated functions for both flavors.




If Length() would return its value in chars, what length in *bytes* 
would the following call set:


SetLength(utfstring_1), Length(utfstring_2));

??

Regards,
Sergei


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-21 Thread Michael Schnell






If Length() would return its value in chars, what length in *bytes* 
would the following call set:


SetLength(utfstring_1), Length(utfstring_2));


I don't really understand your question.

I think would would need to have two different function

UTF8ElementlLength(UTF8String) and UTF8PointLength(UTF8String), first 
giving the string length in code elements (byte) and second giving the 
length in code points (unicode characters),


So UTF8ElementlLength('Ü') would be 2 and UTF8PointLength('Ü') would be 1.

I think we should have a third function Length(UTF8String) that can be 
selected by the user (e.g. via a {$ option to be mapped to wither of the 
two.


The same would be necessary for the SetLength function

e.g.
(1) UTF8ElementSetLength(utfstring_1), UTF8ElementLength(utfstring_2));
or
(2) UTF8PointSetLength(utfstring_1), UTF8PointLength(utfstring_2));

(2) would work as expected if the purpose i to delete all but the first 
n characters in a string.


I don't see a decent use for (1) other than creating a string long 
enough to use as a buffer for e.g. TStream.read.


I do see that there in fact is a compatibility problem when porting old 
code with the setting of UTF8Count=Point.


here

SetLength(utfstring_1), Length(utfstring_2)); would be translated as
UTF8PointSetLength(utfstring_1), UTF8PointLength(utfstring_2));

which does not make sense if UTF8PointLength(utfstring_1) is smaller 
than UTF8PointLength(utfstring_2).


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-21 Thread Jonas Maebe


On 21 Nov 2008, at 14:50, Michael Schnell wrote:

If Length() would return its value in chars, what length in *bytes*  
would the following call set:


SetLength(utfstring_1), Length(utfstring_2));


I don't really understand your question.

I think would would need to have two different function

UTF8ElementlLength(UTF8String) and UTF8PointLength(UTF8String),  
first giving the string length in code elements (byte) and second  
giving the length in code points (unicode characters),


So UTF8ElementlLength('Ü') would be 2 and UTF8PointLength('Ü') would  
be 1.


Or 2, depending on whether it's predcomposed or decomposed.

I think we should have a third function Length(UTF8String) that can  
be selected by the user (e.g. via a {$ option to be mapped to wither  
of the two.


He's simply talking about the case where Length is mapped to your  
proposed UTF8PointLength.


I do see that there in fact is a compatibility problem when porting  
old code with the setting of UTF8Count=Point.


here

SetLength(utfstring_1), Length(utfstring_2)); would be translated as
UTF8PointSetLength(utfstring_1), UTF8PointLength(utfstring_2));

which does not make sense if UTF8PointLength(utfstring_1) is smaller  
than UTF8PointLength(utfstring_2).



It does not make any sense under any circumstances, because there is  
no way for UTF8PointSetLength to know how many bytes it has to  
allocate when you pass a value (any value, regardless of where it  
comes from) to it.



Jonas___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-21 Thread Sergei Gorelkin

Michael Schnell wrote:


I don't really understand your question.

I think would would need to have two different function

UTF8ElementlLength(UTF8String) and UTF8PointLength(UTF8String), first 
giving the string length in code elements (byte) and second giving the 
length in code points (unicode characters),


So UTF8ElementlLength('Ü') would be 2 and UTF8PointLength('Ü') would be 1.

I think we should have a third function Length(UTF8String) that can be 
selected by the user (e.g. via a {$ option to be mapped to wither of the 
two.


The same would be necessary for the SetLength function

e.g.
(1) UTF8ElementSetLength(utfstring_1), UTF8ElementLength(utfstring_2));
or
(2) UTF8PointSetLength(utfstring_1), UTF8PointLength(utfstring_2));

(2) would work as expected if the purpose i to delete all but the first 
n characters in a string.


I don't see a decent use for (1) other than creating a string long 
enough to use as a buffer for e.g. TStream.read.


I do see that there in fact is a compatibility problem when porting old 
code with the setting of UTF8Count=Point.


here

SetLength(utfstring_1), Length(utfstring_2)); would be translated as
UTF8PointSetLength(utfstring_1), UTF8PointLength(utfstring_2));

which does not make sense if UTF8PointLength(utfstring_1) is smaller 
than UTF8PointLength(utfstring_2).


The SetLength function is used mostly for allocating the storage for the 
new strings. Yes, it can be used for truncating the overlong strings, 
but truncating can be perfectly done with Delete (or UTF8Delete).


As you mentioned yourself, allocating utf-8 strings using length in 
codepoints is senseless. This is exactly what I wanted to say initially.


What follows is that for calls like SetLength(str1, Pos('foo', str2)) 
you also cannot freely change the return value of Pos() from elements to 
codepoints. And so on, and so forth.


Regards,
Sergei
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-21 Thread Michael Schnell


So UTF8ElementlLength('Ü') would be 2 and UTF8PointLength('Ü') would 
be 1.

Or 2, depending on whether it's predcomposed or decomposed.
I seem to remember that we discussed this some time ago and the result 
was that the compose (MAC style ?) characters in fact are a single code 
point (Unicode character) that consists of two (maybe more ? ) complete 
code points that are tied together by some special coding, so IMHO it 
can be considered as a single Unicode character in both cases. If this 
would result in a huge table of possibly composed characters I thing we 
would stick to the concept of providing  a decent functionality and 
restrict on those that are currently used by the customers we normally 
address (Mac in Europe and America). A method to provide an extended 
composition table should be provided to have those help themselves who 
really need it.
which does not make sense if UTF8PointLength(utfstring_1) is smaller 
than UTF8PointLength(utfstring_2).
It does not make any sense under any circumstances, because there is 
no way for UTF8PointSetLength to know how many bytes it has to 
allocate when you pass a value (any value, regardless of where it 
comes from) to it.
If UTF8PointLength(utfstring_1) is greater than 
UTF8PointLength(utfstring_2) no new bytes need to be allocated but the 
function is just equivalent to


utfstring1 := UTF8PointCopy(utfstring1, 1, UTF8PointLength(utfstring_2));

To me this does not seem to impose any problem.

-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-21 Thread Michael Schnell


you also cannot freely change the return value of Pos() from elements 
to codepoints.
Of course the counting needs to be consistent for all string functions. 
So changing it on the fly is dangerous (if you keep a count value in 
an integer variable). But this is up to the user.


-Michael

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-21 Thread Jonas Maebe


On 21 Nov 2008, at 16:16, Michael Schnell wrote:

So UTF8ElementlLength('Ü') would be 2 and UTF8PointLength('Ü')  
would be 1.

Or 2, depending on whether it's predcomposed or decomposed.
I seem to remember that we discussed this some time ago and the  
result was that the compose (MAC style ?)


Decomposed and precomposed have nothing to do with Windows vs Mac OS X  
vs Linux vs whatever. They are both equally valid ways to represent  
UTF strings and both have their uses (on all platforms). All programs  
should also be prepared to deal with them, since you never know what  
kind of input you will get.


characters in fact are a single code point (Unicode character) that  
consists of two (maybe more ? ) complete code points that are tied  
together by some special coding, so IMHO it can be considered as a  
single Unicode character in both cases. If this would result in a  
huge table of possibly composed characters I thing we would stick to  
the concept of providing  a decent functionality and restrict on  
those that are currently used by the customers we normally address  
(Mac in Europe and America).


I think you are talking about a different we. Further, inventing our  
own meanings of what a code point or unicode character means is an  
extremely bad idea (you'd also have to rename UTF*Point* routines to  
UTF*FPCLikeChar* so they properly indicate the fact that they do not  
deal with code points). UTF by itself already has enough variations to  
deal with, we will not add our own.


which does not make sense if UTF8PointLength(utfstring_1) is  
smaller than UTF8PointLength(utfstring_2).
It does not make any sense under any circumstances, because there  
is no way for UTF8PointSetLength to know how many bytes it has to  
allocate when you pass a value (any value, regardless of where it  
comes from) to it.
If UTF8PointLength(utfstring_1) is greater than  
UTF8PointLength(utfstring_2) no new bytes need to be allocated


but the function is just equivalent to

utfstring1 := UTF8PointCopy(utfstring1, 1,  
UTF8PointLength(utfstring_2));


To me this does not seem to impose any problem.


Except if the point is to reserve exactly enough space for utfstring1  
and to overwrite its contents with something else afterwards (using  
move() or whatever). That's a very common use of setlength (at least  
in the FPC run time library, and I guess elsewhere as well). The fact  
that it also doesn't work if the string has to be made longer is  
basically the same problem.


Your system just does not work, and the more examples you give the  
more it falls down, as far as I can see. Please first write a wiki  
page explaining how to deal with all cases, or at least noting which  
cases will not work. Only then it is possible to decide on whether or  
not it is both feasible and worthwhile to go through the trouble of  
implementing all this. Without it, I feel I am mainly wasting my time  
writing these mails because it seems you haven't thought it through  
yet at all.



Jonas___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-21 Thread Michael Schnell
If your point is that there is no way to allow for legacy code to be 
used with a String type that holds UTF8 code and that it is not 
possible (or desirable) to allow for code used in simple occasions that 
is understandable to someone who does not want to go into the complete 
depth of the UTF8, I can totally accept this.


But in that case the normal user just should not use UTF8 (but 
WideStrings that in most European/American Projects can be considered 
to be UCS2 coded (This is the way that D2009 seems to go).


With that of course the UTF8 API of LCL is not at all desirable,.

-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-21 Thread Daniël Mantione



Op Fri, 21 Nov 2008, schreef Michael Schnell:

If your point is that there is no way to allow for legacy code to be used 
with a String type that holds UTF8 code and that it is not possible (or 
desirable) to allow for code used in simple occasions that is understandable 
to someone who does not want to go into the complete depth of the UTF8, I can 
totally accept this.


Legacy code that assumes ASCII can be used in UTF-8. Code that needs to 
deal with higher code points needs to be rewritten and the user must 
understand the full UTF-8 spec. There is no other way to hide this.


But in that case the normal user just should not use UTF8 (but WideStrings 
that in most European/American Projects can be considered to be UCS2 coded 
(This is the way that D2009 seems to go).


I agree with your observation.


With that of course the UTF8 API of LCL is not at all desirable,.


LCL had its reasons to go UTF8.

Daniël___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-21 Thread Florian Klaempfl
Folks, before your waste your time again with endless discussions, have
a look at Yury's work on an unicode rtl, test it and help with patches
and suggestions, it's available in svn at
http://svn.freepascal.org/svn/fpc/branches/unicodertl
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-21 Thread Michael Schnell


Legacy code that assumes ASCII can be used in UTF-8. Code that needs 
to deal with higher code points needs to be rewritten 
This is any Program that formerly used (ANSIS) String and now is 
automatically converted to use UTF8 and that is to be released in 
Germany, France 



With that of course the UTF8 API of LCL is not at all desirable,.

LCL had its reasons to go UTF8.
And thus forces all users to understand the full UTF-8 spec and to 
rewrite their programs, even though the old code perfectly compiles and 
up to a certain extent seems to work.


This is what I think is not at all desirable :( .

-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-21 Thread Yury Sidorov

From: Florian Klaempfl [EMAIL PROTECTED]
Folks, before your waste your time again with endless discussions, 
have
a look at Yury's work on an unicode rtl, test it and help with 
patches

and suggestions, it's available in svn at
http://svn.freepascal.org/svn/fpc/branches/unicodertl


It is works for win32 only for now. Only system unit is finished. Work 
in progress...


Yury. 
___

fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-21 Thread Felipe Monteiro de Carvalho
On Fri, Nov 21, 2008 at 2:42 PM, Michael Schnell [EMAIL PROTECTED] wrote:
 And thus forces all users to understand the full UTF-8 spec and to rewrite
 their programs, even though the old code perfectly compiles and up to a
 certain extent seems to work.

 This is what I think is not at all desirable :( .

Your comments are absolutely vague and meaningless. Not to mention
thay also don't propose an alternative.

Sorry to be blunt, but so were your comments.

-- 
Felipe Monteiro de Carvalho
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-21 Thread Martin Friebe

Felipe Monteiro de Carvalho wrote:

On Fri, Nov 21, 2008 at 2:42 PM, Michael Schnell [EMAIL PROTECTED] wrote:
  

And thus forces all users to understand the full UTF-8 spec and to rewrite
their programs, even though the old code perfectly compiles and up to a
certain extent seems to work.

This is what I think is not at all desirable :( .


Your comments are absolutely vague and meaningless. Not to mention
thay also don't propose an alternative.

Sorry to be blunt, but so were your comments


I must agree with the FPC can not to it all automatically line (as 
much as I regret, and admit the beauty there was, if fpc could).


What I mean is:

1) Any Application/Program, that currently compiles and works (using 
none utf8, never mind if ascii or ansi) will keep working, if compiled 
using *none* utf8 mode.


2) If such a program wants to be compiled to be extended to utf8 
support, then there is a need for decisions that can not be made without 
knowledge what the program is doing. Or even within the same program in 
which context the operation takes place.
Such knowledge is only available to the programmer of this application, 
therefore the application must be changed to include this decisions. FPC 
simple can not make them. (And even {$SWITCH} would not solve the issue.)


Example is the composed and decomposed ü:

- If you edit a text (human readable text), or search in a text, you 
certainly do want to handle both representations as equals (a Find 
dialog must find both)
- If the same text editor saves the file, it must handle them as non 
equal.   Assume the user has 2 files wünsche.txt in the same folder. 
The filesystem allows this, because one of them is decomposed and one is 
composed.  If the user had opened a text from the composed version, it 
should be written back to the composed version. If the user had opened 
it from the decomposed version it must be written back to the decomposed 
version. Otherwise a completely unrelated file would simply be 
overwritten, and the contents lost. (the same applies if the application 
iterates through the directory content and compares file names. So here 
the same compare version that would be used by the Find dialog must 
behave different)


FPC can simply not know, if a string contains a file name, which must be 
kept exactly as it, or a string contains some human readable text, which 
would benefit from a normalisation.


If you are going to put a compiler switch in front of each statement to 
indicate the needs, you may as well change the statements. There is no 
one statement for the whole application, as both of the above example 
occur within a single application.


You could use two different UTF8Strings which behave different on 
decomposed chars (I am *not* proposing this as a solution). But then you 
can not just recompile your app by saying string now means UTF8String 
throughout the whole application. You have again to  go through all of 
the source code and edit the app. So you may as well just go through the 
sourcecode, and add the appropriate utf8-clean up calls to those part in 
the code, that will need it.


In the end, switching an application to unicode means that within the 
same app different parts are going to need different handling of unicode 
(where no such difference existed for ascii/ansi). And no compiler can 
figure out which part will need which behaviour.



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode support in RTL - Roadmap

2008-11-21 Thread Luiz Americo Pereira Camara

Graeme Geldenhuys escreveu:

Hi,

I have added a Roadmap section in the following wiki page. If you find
anything missing or not 100% implemented, please add it to the wiki
page.

http://wiki.freepascal.org/FPC_Unicode_support#Roadmap_of_RTL_Unicode_support
  


I started a wiki page to list the use cases where the developers (fpc 
users) are facing problems when dealing with Unicode. This can be useful 
to define what the programmers are expecting from the fpc Unicode 
support. Optionally, suggestion can be made to how fpc can handle each case.


http://wiki.freepascal.org/unicode_use_cases

Luiz
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel