Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Marco van de Voort
In our previous episode, Felipe Monteiro de Carvalho said:
 On Mon, Dec 1, 2008 at 7:33 PM, Martin Friebe [EMAIL PROTECTED] wrote:
  I suggested to have a rtl, that has overloaded functions for each string
  type.
  of course that sounds easier than in fact it will be.
 
 This is about the same as having all string routines in 3 flavours:
 RTLString, utf-8 and utf-16
 
 the utf-8 and utf-16 could be done by assigning rtlstring to the adequate 
 type.
 
 I think this is probably what we will end up with, because users of a
 particular encoding will build convenience routines for their favorite
 RTL routines.

Yes, for the core routines. It is nuts to make stuff like scandatetime in
two different encodings.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Michael Schnell


I never suggested the RTL to be in a fixed encoding. I fully agree 
that this would be far worse.
I suppose there are (quite decently workable) solutions for this. Either 
the RTL (and LCL, FWIW) comes in multiple versions that are used as 
appropriate (user selectable and/or automatically selected), or a string 
type is used that knows about it's internal coding and conversions are 
dynamically done when appropriate.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Michael Schnell



This is about the same as having all string routines in 3 flavours:
RTLString, utf-8 and utf-16
  
What about (real) ANSIString (OS/locale based coded) ? This needs to be 
allowed as the program might need to read such files.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Florian Klaempfl
Felipe Monteiro de Carvalho schrieb:
 On Mon, Dec 1, 2008 at 8:27 PM, Mattias Gaertner
 [EMAIL PROTECTED] wrote:
 I don't see, how a TLCLStrings will *not* break Delphi and Lazarus
 compatibility. Maybe you can give some more details, how it should work.
 
 It was just a initial idea. I now see that TStrings could be improved.
 

Maybe we should make such classes simply generics: using wrappers as the
map and list class already do, the size impact shouldn't be that big.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Michael Schnell



For me, these attempts to make compiler do everything automatically sound
like getting yet another typing saver. 
Maybe I am just being lazy, but it's not a typing saver but regarding 
the previous not-Unicode aware versions it's more a preventer of a 
typing enhancer :) .


OTOH it's not just the typing but to work with commonly used things that 
just work in other programming systems (including previous versions of 
FPC/Lazarus)  - like doing a case of a character type - the user 
programmer needs to learn about the internal encoding of Unicode text. I 
think this should be avoided. Pascal has been a great language for 
programming newcomers up till now. Simple things - like characters and 
strings - should just work (unless you explicitly need extended handling).


I don't suggest that there is a simple solution for this (other than not 
doing Unicode at all) but it's worth discussing.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Michael Schnell
I now understand that  GB2312  and JIS 0213 in fact are the ANSI code 
pages  936 and 932.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Mattias Gärtner
Zitat von Michael Schnell [EMAIL PROTECTED]:


  For me, these attempts to make compiler do everything automatically sound
  like getting yet another typing saver.
 Maybe I am just being lazy, but it's not a typing saver but regarding
 the previous not-Unicode aware versions it's more a preventer of a
 typing enhancer :) .

 OTOH it's not just the typing but to work with commonly used things that
 just work in other programming systems (including previous versions of
 FPC/Lazarus)  - like doing a case of a character type -

... and some things that just don't work like i18n.


 the user
 programmer needs to learn about the internal encoding of Unicode text. I
 think this should be avoided.

Tell the unicode consortium. My guess: they know already.


 Pascal has been a great language for
 programming newcomers up till now. Simple things - like characters and
 strings - should just work (unless you explicitly need extended handling).

 I don't suggest that there is a simple solution for this (other than not
 doing Unicode at all) but it's worth discussing.

IMHO it has been already discussed too often.


Mattias


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Michael Schnell



IMHO it has been already discussed too often.
  
I did not start it and only 1% of the contributions are mine - and yours 
-, so quite obviously there is a decent common wish for a solution of 
what is percept as a problem.


-Michael

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Michael Schnell



avoids automatic
conversion between types as much as possible.
  
I feel that it's a goody of a strongly typed language that automatic 
type conversions can be done by creating the appropriate code statically 
instead of having this embedded in the objects as with variants.


If doing a simple assignment a := b; types are either converted 
appropriately or a compiler error is generated.


All integer and real types are converted automatically. If you try to do 
myInteger := myString; you get a compiler error. But if you do 
myANSIString := myUTF8String; the compiler generates an assignment 
without a conversion, even though the types are provided by the system 
(not by the user) and named according to the possible internal coding. 
(We don't need to discuss why this is like that, the discussion is about 
if it should stay this way.)


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Felipe Monteiro de Carvalho
The type is called ansistring simply for backwards compatibility.

You could start arguing that everything should be intuitive. Take C
for example. What does the  operator tell you about what it does?
Shouldn't it have a intuitive form? But in the end this is how the
language is and this is a useless discussion.

I don't think that all C programmers will rewrite their code anymore
then pascal programmers will rewrite theirs so that you can find a
better name for ansistring.

-- 
Felipe Monteiro de Carvalho
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Michael Schnell



Don't forget that the ansistring type is actually multiple encodings and
even multi byte (even not considering UTF-8). The point is: nobody took
care of it.
  
IMHO a major confusion is generated by calling a string that is supposed 
to hold UTF8 data ANSIString. This never should have happened ! If the 
Unicode support requires that there are strings that hold ANSI code and 
those that hold UTF8 code they should be denoted correctly as ANSIString 
and UTF8String. Storing ÚTF8 in an ANSIString is a sin :). IMHO, not 
providing automatic conversion between these type is a major shortcoming 
of the compiler/RTL and if it not does so, it should not provide the 
types. (Which does not mean that providing the (best possible) automatic 
conversion between these type solves all problems !)


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Florian Klaempfl
Mattias Gärtner schrieb:
 Zitat von Florian Klaempfl [EMAIL PROTECTED]:
 
 Mattias Gaertner schrieb:
 You can optimize for one encoding or optimize for one per platform. I
 know how to optimize for widestrings, for ansistring and for UTF-8
 strings, but I have no experience in optimizing for multiple
 encodings.
 Don't forget that the ansistring type is actually multiple encodings and
 even multi byte (even not considering UTF-8). The point is: nobody took
 care of it.
 
 Yes, they did. They ran their programs only on systems with ansi encoded 
 strings
 or simply passed the strings unchanged.
 That's why the lazarus solution even work with broken UTF-8 strings.
 But now a lot of implicit conversions will be added so all strings must have
 valid encodings. You can no longer pass unknown encoded strings through the
 functions.

First, there will be a bytestring type being not converted. Secondly,
I'am rather sure we find ways to cut these conversions as much as
possible down.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Michael Schnell



I meant more that a lot of people simply ignored in their code that
ansistrings could be also multibyte even not considering UTF-8.
  
Ignoring that ANSI Characters  $7F are locale depending makes a program 
work perfectly in a single country and mostly decently in many others.


Ignoring that ut8-code-points can be coded in two code-elements in an 
ANSIString makes a program work only in countries that use just ASCII, 
Thus not in at all in Europe.


-Michael


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Daniël Mantione


Op Tue, 2 Dec 2008, schreef Michael Schnell:




Nobody talks in this case about UTF-8. Even *ANSIstrings* in there
native meaning can contain multi byte chars, there are *multi byte* ansi
char sets.

If there is a widely used multi-byte ANSI encoding, why so we need Unicode  ?

IMHO the introduction of Unicode has been necessary as (like you suggested) 
multi-byte ANSI encoding was commonly ignored nearly completely and there 
never has been _compiler_ support for them.


What compiler support should be necessary to handle i.e. EUC-JP? You want 
a variable of type char to contain the JIS-0213 coordinates?


Unicode, and in particular UTF-8, has not taken off either because 
languages got support for it. In fact, the most common language, C, has no 
string support at all.


One reason Unicode has taken off because of document exchange, which in 
the internet age got very common. Another reason is the growing importance 
of the Far East, developers want therefore better support for the Far East 
languages, but note this Unicode motivation exists mainly for Western 
software developers.


Daniël___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Florian Klaempfl
Michael Schnell schrieb:
 
 Don't forget that the ansistring type is actually multiple encodings and
 even multi byte (even not considering UTF-8). The point is: nobody took
 care of it.
   
 IMHO a major confusion is generated by calling a string that is supposed
 to hold UTF8 data ANSIString. This never should have happened ! 

Nobody talks in this case about UTF-8. Even *ANSIstrings* in there
native meaning can contain multi byte chars, there are *multi byte* ansi
char sets. However, everybody codes 1 char=1 byte when using ansistrings
which is plainly wrong. Guess why Delphi has functions like
CharToByteIndex or NextCharIndex.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Michael Schnell



So, really? What is not supported?
  

If just ignoring the fact is enough support, OK, it's supported :).

... tell this 1+ Billion (Billion=10^9 in this case) people in China.
  
I did not know (or suppose) that code used for Chinese characters is 
called ANSI (American National Standards Institute).


I supposed one of the main intentions for the move to Unicode was the 
ability to support Chinese above all. So they did not seem to have been 
content with what was done before.


Is anybody from China here to offer the footage ?

-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Sergei Gorelkin

Michael Schnell wrote:



The more I think about it the more I like this solution. I think it's
better then the previous idea of a string with encode information
inside it.
  

Would Lazarus be able to follow ?

Do you think it's possible to have the compiler take care of any 
necessary conversions automatically ?



For me, these attempts to make compiler do everything automatically sound
like getting yet another typing saver. The situation is already dangerously
close to write once, debug forever. Recently (after Lazarus 0.9.26 
release)
I had encountered some cases when a trivial function call resulted in a 
couple
of conversions inserted silently and resulting outcome could be 
explained only

by tracing it or looking at the assembler code.
Another example is issue #11327. Initially I perceived it as a code 
generation issue, but after digging in it was clear that it's caused by 
first choosing an incorrect overloaded function, then inlining it, then 
attempting to optimize. There are already at least 17 overloaded Pos() 
functions, and the compiler simply gets lost between them. That issue is 
likely to be fixed, but the fix will be for the consequences of the 
problem, not for its origin.


Making the conversions automatic does not make the language clean, 
instead it hides the potential errors and author's intentions. Moreover, 
it forces anyone to (implicitly) use these conversions, even those who 
don't need it.


A notable fact is also that while all these endless speech about lack 
of Unicode support in compiler, nearly all well-known Unicode 
processing software is written in languages that have no built-in 
support not only for Unicode, but for the strings itself.


Regards,
Sergei
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Florian Klaempfl
Michael Schnell schrieb:
 
 Nobody talks in this case about UTF-8. Even *ANSIstrings* in there
 native meaning can contain multi byte chars, there are *multi byte* ansi
 char sets.
 If there is a widely used multi-byte ANSI encoding, why so we need
 Unicode  ?
 
 IMHO the introduction of Unicode has been necessary as (like you
 suggested) multi-byte ANSI encoding was commonly ignored nearly
 completely and there never has been _compiler_ support for them.

So, really? What is not supported?

 Thus
 IMHO it's quite appropriate to only call ANSI only the 1-Byte ANSI
 code versions 

... tell this 1+ Billion (Billion=10^9 in this case) people in China.

 (to be able to tell them technically from Unicode, the
 compiler support of which is discussed right here).
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Michael Schnell

Felipe Monteiro de Carvalho wrote:

Ignore the name ansi. Take it as a string type with the system
encoding. I think it will solve the confusion.
  
Of course if you ignore ANSI and just use the type named String 
there is no confusion as it's clear that the coding is not predefined.


That is exactly what I wanted to say: If you don't use it for ANSI coded 
information don't name the type ANSIString. As FPC provides the type 
ANSIString out of the box it should be used appropriately and this any 
new user will suppose that there is support for conversion between this 
type and other string types that explicitly are called differently 
according to their suggested internal coding (such as UTF8String.


If it does not work that way this calls for major confusion.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Florian Klaempfl
Michael Schnell schrieb:
 
 I meant more that a lot of people simply ignored in their code that
 ansistrings could be also multibyte even not considering UTF-8.
   
 Ignoring that ANSI Characters  $7F are locale depending makes a program
 work perfectly in a single country and mostly decently in many others.

So it works in far east with its multi byte ansi encodings?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Michael Schnell



The point is: if everybody takes care of the fact that ansistrings can
be multibyte, having utf-8 in ansistrings (if it's the locale encoding),
is no big deal at all.
  

I do understand. But (in a real world) do you know anybody who does.

If it would be appropriate  for ANSI code handling to take care of 
Multi-byte encoding we would not need locale-based code tables and en 
effect Unicode would not have been invented.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Florian Klaempfl
Mattias Gaertner schrieb:
 You can optimize for one encoding or optimize for one per platform. I
 know how to optimize for widestrings, for ansistring and for UTF-8
 strings, but I have no experience in optimizing for multiple
 encodings. 

Don't forget that the ansistring type is actually multiple encodings and
even multi byte (even not considering UTF-8). The point is: nobody took
care of it.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Florian Klaempfl
Michael Schnell schrieb:
 
 So, really? What is not supported?
   
 If just ignoring the fact is enough support, OK, it's supported :).

What FUD is this? Pleaes give an example where the FPC compiler doesn't
handle multi byte ansistrings properly.

Or do you just want to troll around? This problem can be solved ...
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Mattias Gärtner
Zitat von Florian Klaempfl [EMAIL PROTECTED]:

 Mattias Gaertner schrieb:
  You can optimize for one encoding or optimize for one per platform. I
  know how to optimize for widestrings, for ansistring and for UTF-8
  strings, but I have no experience in optimizing for multiple
  encodings.

 Don't forget that the ansistring type is actually multiple encodings and
 even multi byte (even not considering UTF-8). The point is: nobody took
 care of it.

Yes, they did. They ran their programs only on systems with ansi encoded strings
or simply passed the strings unchanged.
That's why the lazarus solution even work with broken UTF-8 strings.
But now a lot of implicit conversions will be added so all strings must have
valid encodings. You can no longer pass unknown encoded strings through the
functions.


Mattias

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Michael Schnell



Nobody talks in this case about UTF-8. Even *ANSIstrings* in there
native meaning can contain multi byte chars, there are *multi byte* ansi
char sets.
If there is a widely used multi-byte ANSI encoding, why so we need 
Unicode  ?


IMHO the introduction of Unicode has been necessary as (like you 
suggested) multi-byte ANSI encoding was commonly ignored nearly 
completely and there never has been _compiler_ support for them. Thus 
IMHO it's quite appropriate to only call ANSI only the 1-Byte ANSI 
code versions (to be able to tell them technically from Unicode, the 
compiler support of which is discussed right here).


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Michael Schnell



If just ignoring the fact is enough support, OK, it's supported :).



What FUD is this? Pleaes give an example where the FPC compiler doesn't
handle multi byte ansistrings properly.
  
Sorry for bad language :( !  I did not mean to be aggressive.  (Did you 
see the smile indicator ?)


I did not suggest it handles this wrong in any way, but I just don't see 
in what way there might be any explicit compiler support for multi-byte 
ANSI. (You did mention the RTL function provided.)


I understand that there never has been a discussion on if there should 
be any explicit compiler support for multi-byte ANSI, but this thread 
_is_ a discussion on explicit compiler support for Unicode.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Michael Schnell



It simply needs no explicit support except what it has already. Mainly
the rtl and the user program has to take care of it and we did this
already in the rtl but the compiler required no fix in this regard so far.
  

I do see your point !

But my point is that with the introduction of Unicode, compiler support 
for handling of these things is introduced (and the RTL and the LCL). I 
think this should result in making user-code largely unnecessary and not 
in requiring those programmer, that did not need multi-byte support for 
serving the users they want to deploy their software to, to finally 
start to introduce multi-byte handling in their user-program code.


IMHO, if ever possible, a new version of a program should make life for 
the majority of the actual users easier and avoid making life more 
complicated for those that not willingly decide that they need the 
complexity.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Florian Klaempfl
Michael Schnell schrieb:
 
 If just ignoring the fact is enough support, OK, it's supported :).
 

 What FUD is this? Pleaes give an example where the FPC compiler doesn't
 handle multi byte ansistrings properly.
   
 Sorry for bad language :( !  I did not mean to be aggressive.  (Did you
 see the smile indicator ?)
 
 I did not suggest it handles this wrong in any way, but I just don't see
 in what way there might be any explicit compiler support for multi-byte
 ANSI. (You did mention the RTL function provided.)

It simply needs no explicit support except what it has already. Mainly
the rtl and the user program has to take care of it and we did this
already in the rtl but the compiler required no fix in this regard so far.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Daniël Mantione



Op Tue, 2 Dec 2008, schreef Michael Schnell:


Thanks for pointing this out.


GB2312 suits them well. Likewise, JIS 0213 suits the Japanese well. 

Are these called ANSI ?


Yes, code page 936 and code page 932 are valid ANSI code pages.

These standards by themselves of course not, because they are 
a Chinese respective Japanese industrial standard.


Daniël___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Florian Klaempfl
Michael Schnell schrieb:
 Felipe Monteiro de Carvalho wrote:
 Ignore the name ansi. Take it as a string type with the system
 encoding. I think it will solve the confusion.
   
 Of course if you ignore ANSI and just use the type named String
 there is no confusion as it's clear that the coding is not predefined.
 
 That is exactly what I wanted to say: If you don't use it for ANSI coded
 information don't name the type ANSIString. As FPC provides the type
 ANSIString out of the box it should be used appropriately and this any
 new user will suppose 

Really? Pascal is a strongly typed language which avoids automatic
conversion between types as much as possible.

 that there is support for conversion between this
 type and other string types that explicitly are called differently
 according to their suggested internal coding (such as UTF8String.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Michael Schnell



Btw will the LCL remain forcedly UTF-8 ? I thought the current Lazarus
unicode support was temporary and all options were still open, depending on
the outcome of FPC unicode support options?
I understand they could not do it differently (other than just providing 
no Unicode support at all), maybe as there is no automatic type 
conversion support with FPC (e.g. ANSIString-UTF8String).


As the current version is not very satisfying I do hope for a change, 
but I also do understand that the Lazarus team will wait for what the 
next FPC version offers on that behalf.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Daniël Mantione



Op Tue, 2 Dec 2008, schreef Michael Schnell:

I supposed one of the main intentions for the move to Unicode was the ability 
to support Chinese above all. So they did not seem to have been content with 
what was done before.


Is anybody from China here to offer the footage ?


It is not the Chinese that are pushing for Unicode. GB2312 suits them 
well. Likewise, JIS 0213 suits the Japanese well. Those encodings also 
have the characters to support Western languages, or Greek, or Russian.
Unicode is used as a technical solution to handle Chinese is for an 
important part a Western development. Eastern developers already support 
Eastern languages.


Daniël___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Felipe Monteiro de Carvalho
On Tue, Dec 2, 2008 at 9:00 AM, Michael Schnell [EMAIL PROTECTED] wrote:
 I still don't understand what ANSI has to do with System.

Ignore the name ansi. Take it as a string type with the system
encoding. I think it will solve the confusion.

-- 
Felipe Monteiro de Carvalho
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Michael Schnell



The more I think about it the more I like this solution. I think it's
better then the previous idea of a string with encode information
inside it.
  

Would Lazarus be able to follow ?

Do you think it's possible to have the compiler take care of any 
necessary conversions automatically ?


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Mattias Gärtner
Zitat von Michael Schnell [EMAIL PROTECTED]:


  The point is: if everybody takes care of the fact that ansistrings can
  be multibyte, having utf-8 in ansistrings (if it's the locale encoding),
  is no big deal at all.
 
 I do understand. But (in a real world) do you know anybody who does.

 If it would be appropriate  for ANSI code handling to take care of
 Multi-byte encoding we would not need locale-based code tables and en
 effect Unicode would not have been invented.

UTF-8 is unicode and it is the system encoding on linux, OS X, some BSDs and
Solaris. So ansistrings are UTF-8 there.


Mattias

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Michael Schnell

Thanks for pointing this out.


GB2312 suits them well. Likewise, JIS 0213 suits the Japanese well. 

Are these called ANSI ?


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Florian Klaempfl
Michael Schnell schrieb:
 Thanks for pointing this out.

 GB2312 suits them well. Likewise, JIS 0213 suits the Japanese well. 
 Are these called ANSI ?

Every well educated windows programmer knows that the ansi
functions/strings whatever are not limited to the so-called ansi code
pages (which aren't ansi either afaik) or is CP850/CP1252 as on my
german windows an ansi code page ;)?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Michael Schnell



UTF-8 is unicode and it is the system encoding on linux, OS X, some BSDs and
Solaris. So ansistrings are UTF-8 there.
  

I still don't understand what ANSI has to do with System.

AFAIK, The term ANSI Code stands for a (codepage depending) definition 
for a character encoding and Unicode is another one. Both are 
independent of Operating systems.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Florian Klaempfl
Michael Schnell schrieb:
 
 The point is: if everybody takes care of the fact that ansistrings can
 be multibyte, having utf-8 in ansistrings (if it's the locale encoding),
 is no big deal at all.
   
 I do understand. But (in a real world) do you know anybody who does.
 
 If it would be appropriate  for ANSI code handling to take care of
 Multi-byte encoding we would not need locale-based code tables and en
 effect Unicode would not have been invented.

Multibyte ansi chars are still not unique and require the code page for
proper interpretation, this is why Unicode has been invented. By using
properly functions like ChartoByteIndex or NextCharIndex, it makes very
likely no difference for string processing code if the strings are multi
byte ansi or utf-8 unicode.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-02 Thread Michael Schnell



Not to mention: What would the alternative be?
  
Even if I am not satisfied with the current state of Lazarus on that 
behalf, I would not dare to suggest that Lazarus should do any change 
here before the next version of FPC offers a new string handling with 
either a string type that knows it's internal  coding and with that any 
conversions can be done automatically, or with multiple string types 
defined of which the compiler knows how to convert them if necessary, or 
with whatever solution the FPC team comes up with.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Florian Klaempfl
Felipe Monteiro de Carvalho schrieb:
 Hello,
 
 Some things weren't clear from the previous discussion, so I would
 like to clarify them.
 
 For instance, the GetTempFileName routine:
 
 http://www.freepascal.org/docs-html/rtl/sysutils/gettempfilename.html
 
 The routine is currently ANSI, but we need a unicode version of it.
 How would that unicode version look like? We currently have 3 unicode
 string types planned AFAIK: 

No.

 
 I assume that the new variable encoding type would be used for all
 unicode routines, am I right?

No, it will be RTLString which type depends on the OS.

 
 Or would versions for all 3 types be added? (for example, if someone
 donates utf8 routines).
 
 thanks,

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Felipe Monteiro de Carvalho
On Mon, Dec 1, 2008 at 10:13 AM, Florian Klaempfl
[EMAIL PROTECTED] wrote:
 No, it will be RTLString which type depends on the OS.

Ok, so code would be something like this:

var
  OSString: RTLString;
  MyString: UTF8String;
begin
  OSString := SomeRTLRoutine;
  MyString := OSString;

?

It will be funny to use a string type about which nothing is known. I
wonder if people will abuse this and start operating system dependent
code.

-- 
Felipe Monteiro de Carvalho
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Felipe Monteiro de Carvalho
On Mon, Dec 1, 2008 at 10:42 AM, Florian Klaempfl
[EMAIL PROTECTED] wrote:
 Why would you do this and not
 MyString := SomeRTLRoutine;

You are right, that should do it. I was thinking about var parameters.

-- 
Felipe Monteiro de Carvalho
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Martin Friebe

Florian Klaempfl wrote:

Felipe Monteiro de Carvalho schrieb:
  

On Mon, Dec 1, 2008 at 10:13 AM, Florian Klaempfl
[EMAIL PROTECTED] wrote:


No, it will be RTLString which type depends on the OS.
  

Ok, so code would be something like this:

var
  OSString: RTLString;
  MyString: UTF8String;
begin
  OSString := SomeRTLRoutine;
  MyString := OSString;

?



Why would you do this and not
MyString := SomeRTLRoutine;
?
  
If I understand that right, this may cause some overhead, that in 
some(few) cases is not needed.


If I write an application using  stringtype X (WideString for 
example), then in the above MyString would be WideString.


The in/ouput for SomeRTLRoutine are RtlString, they are OS depended. If 
I compile for a OS using UTF8 then that means for each and every call, 
it needs a string conversation.


Of course I understand, *if* some RTLFunction calls the OS, then the 
string must be converted. But if I simply want to extract the drive 
letter, or trim the path, and get the file name, without actually 
accessing the file or OS? Should it be possible to skip converting?


Best Regards
Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Florian Klaempfl
Felipe Monteiro de Carvalho schrieb:
 On Mon, Dec 1, 2008 at 10:13 AM, Florian Klaempfl
 [EMAIL PROTECTED] wrote:
 No, it will be RTLString which type depends on the OS.
 
 Ok, so code would be something like this:
 
 var
   OSString: RTLString;
   MyString: UTF8String;
 begin
   OSString := SomeRTLRoutine;
   MyString := OSString;
 
 ?

Why would you do this and not
MyString := SomeRTLRoutine;
?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Marco van de Voort
In our previous episode, Martin Friebe said:
 
  ?
  
 
  Why would you do this and not
  MyString := SomeRTLRoutine;
  ?

 If I understand that right, this may cause some overhead, that in 
 some(few) cases is not needed.

Correct. 
 
 If I write an application using  stringtype X (WideString for 
 example), then in the above MyString would be WideString.

Correct.

 The in/ouput for SomeRTLRoutine are RtlString, they are OS depended. If 
 I compile for a OS using UTF8 then that means for each and every call, 
 it needs a string conversation.

Correct.

 Of course I understand, *if* some RTLFunction calls the OS, then the 
 string must be converted. But if I simply want to extract the drive 
 letter, or trim the path, and get the file name, without actually 
 accessing the file or OS? Should it be possible to skip converting?

Use rtlstring. Do the conversion to widestring after.

IOW, you should do it the other way around. Use the OS dependant stringtype
for mostly encoding independant operations, and only the few things where
you need specific encodings force a certain encoding (using utf8string or
widestring)

.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Martin Friebe

Marco van de Voort wrote:

In our previous episode, Martin Friebe said:
  

Why would you do this and not
MyString := SomeRTLRoutine;
?  
  
If I understand that right, this may cause some overhead, that in 
some(few) cases is not needed.

Correct.  
  
If I write an application using  stringtype X (WideString for 
example), then in the above MyString would be WideString.


Correct
The in/ouput for SomeRTLRoutine are RtlString, they are OS depended. If 
I compile for a OS using UTF8 then that means for each and every call, 
it needs a string conversation.


Correct.
  
Of course I understand, *if* some RTLFunction calls the OS, then the 
string must be converted. But if I simply want to extract the drive 
letter, or trim the path, and get the file name, without actually 
accessing the file or OS? Should it be possible to skip converting?



Use rtlstring. Do the conversion to widestring after.

IOW, you should do it the other way around. Use the OS dependant stringtype
for mostly encoding independant operations, and only the few things where
you need specific encodings force a certain encoding (using utf8string or
widestring)

  
I agree, using RTlString will probably help fpc to optimize your exe for 
each OS.


But, using RTLString means you do not know, if you have UTF8 or not. 
Because UTF8 behaves slightly different from other Strings, many 
operations can not be performed on RTLString


foo[1], copy, pos ... simply because you do not know, if the result is a 
char, a codepoint or a subcodepoint (single utf8 byte)


RTLString is or will be great, if you simply need to store an OS 
depended string in order to later give it back to the OS. (eg open file, 
remember file name, but do not process it (displaying it would be vi 
OS), and save file back to the same name.)


For this you could also use ByteString: if there is such a thing, and if 
it behaves as not converting, if assigned to any string



Best Regards
Martin


---
Disclaimer: Just to keep this discussion where it was:
- I do understand why the above is as it is (string index not being utf8 
chart access).
- I do not believe that this is correct too (and any discussion should 
be a new thread)


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Martin Friebe

Marco van de Voort wrote:
In our previous episode, Martin Friebe said: 
  
I agree, using RTlString will probably help fpc to optimize your exe for 
each OS.


But, using RTLString means you do not know, if you have UTF8 or not. 


Correct.
  
Because UTF8 behaves slightly different from other Strings, many 
operations can not be performed on RTLString


foo[1], copy, pos ... simply because you do not know, if the result is a 
char, a codepoint or a subcodepoint (single utf8 byte)


You don't know that about UTF-16 either. Even though that is no problem in
  

True, good point

most cases, it is slowly time to abandon too simplistic thinking about
strings. The best solution is to minimize editing, and localize them in
certain parts of the code, keeping most of the code encoding agnostic.
  
True, too. But we are talking Pascal, not some other language. 
string[index], copy, pos, length have always been part of Pascal.


Of course they are still there, to be used in the few parts of your 
code, where you specialize on whatever string type you deal with.
But otherwise, using  RTLString  IMHO will abandon this part of pascal 
syntax.  A function of which the result can not be used, as it can 
change at compile time = such a function can not be used. (or we will 
have buffer overflows, code injection and more ...)


I admit that the Problem started (and that has been discussed more than 
enough) starts with UTF8string (yes even with utf16 string). But in this 
case those functions became a new, but predictable meaning. I can do 
utf8string[1], and I can use the result. Only I have to be aware what it 
means.


I can *not* do rtlString[1], as at the time of code writing I can not be 
aware what it means. It is only decided, at compilation time. IFDEFs 
won't help neither, because they can only cope with the set of 
stringtypes know at the time the code is written.  This breaks each time 
FPC will be extended.


 and localize them in
 certain parts of the code, keeping most of the code encoding agnostic.
Sorry I can't help taking that into another direction, (which also has 
been discussed before). The above quote sounds like a sentence from a 
introduction into  object orientation.  Sure it is the right thing.. 
It is right for OO. So it should be right for strings as well.
Just again, it simply will be a new language, which a string-object, 
instead of pascal.



And yes, if you lazy, you lose performance due to automatic conversions. It
has always been that way (also when mixing short and ansistring)
  
In other words, write pascal code, just do not use some of the (imho) 
most common elements of pascal syntax?
I acknowledge a language is a living thing, and needs to be adjusted to 
the new things, that come up over time. I only ask, if this is the best way?



This is not just a good thing for OS interfacing code, but a good thing in
general.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel
  

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Florian Klaempfl
Martin Friebe schrieb:
 In other words, write pascal code, just do not use some of the (imho)
 most common elements of pascal syntax?
 I acknowledge a language is a living thing, and needs to be adjusted to
 the new things, that come up over time. I only ask, if this is the best
 way?

We're open to proposals, make one.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Marco van de Voort
In our previous episode, Martin Friebe said:
  most cases, it is slowly time to abandon too simplistic thinking about
  strings. The best solution is to minimize editing, and localize them in
  certain parts of the code, keeping most of the code encoding agnostic.

 True, too. But we are talking Pascal, not some other language. 
 string[index], copy, pos, length have always been part of Pascal.

So keep using ansistring? It doesn't change.
 
 Of course they are still there, to be used in the few parts of your 
 code, where you specialize on whatever string type you deal with.
 But otherwise, using  RTLString  IMHO will abandon this part of pascal 
 syntax.

It removes ASCII legacy. I don't see you complaining about the fact that
char is not 8 bit anymore, and that that abandons that part of the pascal
syntax.

 A function of which the result can not be used, as it can 
 change at compile time = such a function can not be used. (or we will 
 have buffer overflows, code injection and more ...)

Hence my suggestion to minimize this functionality.
 
 I admit that the Problem started (and that has been discussed more than 
 enough) starts with UTF8string (yes even with utf16 string). But in this 
 case those functions became a new, but predictable meaning. I can do 
 utf8string[1], and I can use the result. Only I have to be aware what it 
 means.

Yes. As widestring[1] also requires interpretation. That's unicode.
 
 I can *not* do rtlString[1], as at the time of code writing I can not be 
 aware what it means.

You don't have to. You carry it around as long as you can, and when you
don't can, you assign it to your type of choice and bite the penalty.

Delaying that as long as possible avoids excessive penalities, which IMHO
are as much part of the Pascal language. Doing that would hurt the general
purpose nature by turning into basic. (and then I mean the real Basics, not
the C-with-basic-syntax that is FreeBasic), or worse: Excel.

 It is only decided, at compilation time. IFDEFs won't help neither,
 because they can only cope with the set of stringtypes know at the time
 the code is written.  This breaks each time FPC will be extended.

Any such big transition as ASCII - Unicode will break. However we have had
these discussions before, but avoiding all pitfalls is simply too costly,
and that breaks other Pascal traditions.

   and localize them in
   certain parts of the code, keeping most of the code encoding agnostic.
 Sorry I can't help taking that into another direction, (which also has 
 been discussed before). The above quote sounds like a sentence from a 
 introduction into  object orientation. 

It is an introduction to abstraction maybe. I don't see the OO in there.

 It is right for OO. So it should be right for strings as well.
 Just again, it simply will be a new language, which a string-object, 
 instead of pascal.

This is all gibberish for me. I never said OO, and never will.

  And yes, if you lazy, you lose performance due to automatic conversions. It
  has always been that way (also when mixing short and ansistring)

 In other words, write pascal code, just do not use some of the (imho) 
 most common elements of pascal syntax?

There is no just. Strings simply get more complicated if you go unicode,
and that can't be hidden. Either you stay with safe ASCII strings, or you
use Unicode. If you do the latter, you will have to adapt anyway.

And top-heavy emulation layers are not Pascallike either.

 I acknowledge a language is a living thing, and needs to be adjusted to 
 the new things, that come up over time. I only ask, if this is the best way?

IMHO there is not even a choice, since there simply no is a viable
alternative.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Luiz Americo Pereira Camara

Marco van de Voort escreveu:

In our previous episode, Martin Friebe said:
  

most cases, it is slowly time to abandon too simplistic thinking about
strings. The best solution is to minimize editing, and localize them in
certain parts of the code, keeping most of the code encoding agnostic.
  
  
True, too. But we are talking Pascal, not some other language. 
string[index], copy, pos, length have always been part of Pascal.



So keep using ansistring? It doesn't change.
  


Not true if fpc will follow Delphi. The new AnsiString type will be also 
automatically converted in Delphi 2009. See the Marco Cantu doc about 
Unicode (linked some threads ago).


Luiz
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Florian Klaempfl
Martin Friebe schrieb:
 Marco van de Voort wrote:
 In our previous episode, Martin Friebe said:
  
 Of course they are still there, to be used in the few parts of your
 code, where you specialize on whatever string type you deal with.
 But otherwise, using  RTLString  IMHO will abandon this part of
 pascal syntax.
 
 It removes ASCII legacy. I don't see you complaining about the fact that
 char is not 8 bit anymore, and that that abandons that part of the pascal
 syntax.
   
 It does not abandon the syntax.  It only adds to it's meaning (*adds*,
 any existing meaning is unaltered.).
 
 I can still do:  foo[1]  for *any* type of string. (well yes even
 RTLstring, but see below)
 - If string happens to be an old ascii string, that still works as it
 always has
 - If string happens to be any unicode = that is still the same syntax,
 but with a new meaning.
  The new meaning doe snot break anything, because it only applies to new
 types.
  It is usable too, because I know, I am dealing with codepoints, or sub
 code points. And I know how they look, and how to identify them
 
 The introduction of RTLString is fine. I do say it is a good thing.
 RTLString does not interfere with the above. In fact even for RTLstring
 the syntax  foo[1]  does exist. Just it is not useful. If I tread it as
 utf8 sub code point, I can be wrong. If I tread it as ascii, I can be
 wrong. If I tread it as UTF16 I can be wrong
 
 My argument was not against RTLString. However it was my understanding
 that RTL functions will enforce RTLString. That they will only exist
 for RTLString, and they will *not* exist for other string types.
 That I would call enforcing RTLString, because of penalties on using
 other string types.
 
 I acknowledge, that if the end result of calling the RTL function, is an
 OS call, the conversation/penalty is always there. But not every RTL
 function ends up in an OS call.
 
 I admit that the Problem started (and that has been discussed more
 than enough) starts with UTF8string (yes even with utf16 string). But
 in this case those functions became a new, but predictable meaning. I
 can do utf8string[1], and I can use the result. Only I have to be
 aware what it means.
 

 Yes. As widestring[1] also requires interpretation. That's unicode.
   
 See above: Yes it requires interpretation. But it allows me to do so
 
 I can not see how I can interpret RtlString[1]. If the result is bigger
 than 128, then I must know what type it is. If it is ANSI, it is a
 single byte char. If it is utf8, it is a sub-codepoint which will be
 part of a codepoint.
 If it is widestring, well yes, here breaks my assumption that
 RtlString[1] returns a byte ouch
 

I see this as a theoretic consideration. Please give a real world (!)
code example when this causes a problem.

If you assign the result of an rtl function to an rtlstring, this means
you don't care about the type of rtlstring[1] or the knowledge, that
it's type is rtlchar is enough for you. If you assign it to an
ansistring/widestring whatever, you know what you get.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Luiz Americo Pereira Camara

Martin Friebe escreveu:

Marco van de Voort wrote:
In our previous episode, Martin Friebe said:  
I agree, using RTlString will probably help fpc to optimize your exe 
for each OS.


But, using RTLString means you do not know, if you have UTF8 or not. 


Correct.
 
Because UTF8 behaves slightly different from other Strings, many 
operations can not be performed on RTLString


foo[1], copy, pos ... simply because you do not know, if the result 
is a char, a codepoint or a subcodepoint (single utf8 byte)

You don't know that about UTF-16 either. Even though that is no 
problem in
  

True, good point

most cases, it is slowly time to abandon too simplistic thinking about
strings. The best solution is to minimize editing, and localize them in
certain parts of the code, keeping most of the code encoding agnostic.
  
True, too. But we are talking Pascal, not some other language. 
string[index], copy, pos, length have always been part of Pascal.


Of course they are still there, to be used in the few parts of your 
code, where you specialize on whatever string type you deal with.
But otherwise, using  RTLString  IMHO will abandon this part of pascal 
syntax.  A function of which the result can not be used, as it can 
change at compile time = such a function can not be used. (or we will 
have buffer overflows, code injection and more ...) 


To use safely RTLString, at first look, would be be sufficient to use 
overloaded functions from the Characters unit (introduced in Delphi 
2009). See http://www.jacobthurman.com/?p=30 how you can use them to get 
Copy, Pos behavior.


Next week, i'll implement those functions for UTF16 and UTF8 and do some 
tests.


Luiz
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Michael Schnell



And yes, if you lazy, you lose performance due to automatic conversions. It
has always been that way (also when mixing short and ansistring)
  

Of course you are very right here !

If you are lazy and write your code like you are used to, you will not 
get optimum performance with a new compiler that now allows for Unicode. 
But the code still needs to be working as expected (as with a compiler 
version that does not allow for Unicode, but simply uses ANSI or 
whatever OS and locale depending 8-Bit code).


In most programs that will not be a problem at all as doing extensive 
string calculations in user-code is not necessary.


Of course, if you want to take real advantage of Unicode (using 
characters outside your current locale) or if you want to optimize (for 
speed or for memory size) you need to be aware of the Unicode stuff and 
write your code appropriately.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Yury Sidorov

From: Michael Schnell [EMAIL PROTECTED]
And yes, if you lazy, you lose performance due to automatic 
conversions. It

has always been that way (also when mixing short and ansistring)


Of course you are very right here !

If you are lazy and write your code like you are used to, you will 
not get optimum performance with a new compiler that now allows for 
Unicode. But the code still needs to be working as expected (as with 
a compiler version that does not allow for Unicode, but simply uses 
ANSI or whatever OS and locale depending 8-Bit code).


In most programs that will not be a problem at all as doing 
extensive string calculations in user-code is not necessary.


Of course, if you want to take real advantage of Unicode (using 
characters outside your current locale) or if you want to optimize 
(for speed or for memory size) you need to be aware of the Unicode 
stuff and write your code appropriately.


It is planned to allow users to build ANSI version of RTL which will 
be fully compatible with existing user code.
But if you choose to use unicode RTL, you must keep in mind all 
unicode specific things...


Yury. 
___

fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Michael Schnell



So keep using ansistring? It doesn't change.
  
Only if the bytes in the ANSIString in fact are ANSI (which the compiler 
in the moment) does not take care for if doing myANSIString := 
myUTF8String etc.


I feel that with Widestring the pos() etc paradigms stay usable in more 
cases than with ANSIString.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Michael Schnell



I don't see you complaining about the fact that
char is not 8 bit anymore, and that that abandons that part of the pascal
syntax.
  

When doing the most common string stuff like
case s[i] of
'1', 'a', 'ä':
...

This does not really hurt.

even
n := ord(s[i]) - ord('0'); works with 16 bit/char strings.

-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Michael Schnell


It is planned to allow users to build ANSI version of RTL which will 
be fully compatible with existing user code.
But if you choose to use unicode RTL, you must keep in mind all 
unicode specific things...
This will be very helpful for the time being. Let's hope that the LCL 
will follow the Path of allowing the user to choose if he actually wants 
to use Unicode in the user code without explicitly calling Unicode 
functions.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Florian Klaempfl
Michael Schnell schrieb:
 
 It is planned to allow users to build ANSI version of RTL which will
 be fully compatible with existing user code.
 But if you choose to use unicode RTL, you must keep in mind all
 unicode specific things...
 This will be very helpful for the time being. 

It is not helpful because on an utf-8 system ansistring contains utf-8.
Ansistring just means: use the system locale 8 bit encoding.

 Let's hope that the LCL
 will follow the Path of allowing the user to choose if he actually wants
 to use Unicode in the user code without explicitly calling Unicode
 functions.
 
 -Michael
 ___
 fpc-devel maillist  -  fpc-devel@lists.freepascal.org
 http://lists.freepascal.org/mailman/listinfo/fpc-devel
 

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Luiz Americo Pereira Camara

Martin Friebe escreveu:

All the code
 Widestring := RtlFunction;
 Utf8string := RtlFunction;
will run, it may just perform badly.


Yes and no.

Let's assume the platforms windows and unix having UnicodeString 
(UTF-16) and UTF8String as native types respectively.

You choose to use UnicodeString type in your app.

Using the rtlstring approach you get:

Under windows: the native string type of platform is the same as you are 
using no conversion is taken. Good.
Under unix: the native string type of platform is NOT the same as you 
are using ONE conversion is taken. Bad.


Now let's assume that fpc team decided to use a fixed unicode encoding 
for the RTL. Let's say a UnicodeString RTL.

You choose to use UnicodeString type in your app.

Under windows no conversions. Everything is UTF16. Good.
Under unix the RTL must internally convert from the native type (UTF8) 
to UTF16. Bad.


The same result as above.

But someone else wants/needs to use UTF8 strings in your project.

Under windows you will get one conversion: UTF16 - UTF8. Bad.
Under unix you will get TWO conversions: UTF8 - UTF16 - UTF8. Very Bad.

Luiz
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Mattias Gaertner
On Mon, 01 Dec 2008 16:36:23 +0100
Florian Klaempfl [EMAIL PROTECTED] wrote:

[...] Martin Friebe schrieb:
  I can not see how I can interpret RtlString[1]. If the result is
  bigger than 128, then I must know what type it is. If it is ANSI,
  it is a single byte char. If it is utf8, it is a sub-codepoint
  which will be part of a codepoint.
  If it is widestring, well yes, here breaks my assumption that
  RtlString[1] returns a byte ouch
  
 
 I see this as a theoretic consideration. Please give a real world (!)
 code example when this causes a problem.

Can you give a real world example where a different RTLString for
each platform solves a problem?

 
 If you assign the result of an rtl function to an rtlstring, this
 means you don't care about the type of rtlstring[1] or the knowledge,
 that it's type is rtlchar is enough for you. If you assign it to an
 ansistring/widestring whatever, you know what you get.

What string type will be TStrings.Items and the many other strings in
the classes.pp?


Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Florian Klaempfl
Mattias Gaertner schrieb:
 On Mon, 01 Dec 2008 16:36:23 +0100
 Florian Klaempfl [EMAIL PROTECTED] wrote:
 
 [...] Martin Friebe schrieb:
 I can not see how I can interpret RtlString[1]. If the result is
 bigger than 128, then I must know what type it is. If it is ANSI,
 it is a single byte char. If it is utf8, it is a sub-codepoint
 which will be part of a codepoint.
 If it is widestring, well yes, here breaks my assumption that
 RtlString[1] returns a byte ouch

 I see this as a theoretic consideration. Please give a real world (!)
 code example when this causes a problem.
 
 Can you give a real world example where a different RTLString for
 each platform solves a problem?

It solves for example the problem that there are platforms where no
unicode support is available or desired and it avoids unneeded
conversions. I'd be fine using utf-16 on all platforms :)

 
  
 If you assign the result of an rtl function to an rtlstring, this
 means you don't care about the type of rtlstring[1] or the knowledge,
 that it's type is rtlchar is enough for you. If you assign it to an
 ansistring/widestring whatever, you know what you get.
 
 What string type will be TStrings.Items and the many other strings in
 the classes.pp?

Not yet decided though I'd make them RTLString as well.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Marco van de Voort
In our previous episode, Luiz Americo Pereira Camara said:
  string[index], copy, pos, length have always been part of Pascal.
  
 
  So keep using ansistring? It doesn't change.
 
 Not true if fpc will follow Delphi. The new AnsiString type will be also 
 automatically converted in Delphi 2009.

As far as I know, the default is still ascii in the default system ascii
encoding.

 See the Marco Cantu doc about 
 Unicode (linked some threads ago).

I got it from Alan Bauers blog in may (before Tiburon was out), but while
ansistring changes, afaik the widestring to ansistring-without-qualifier
stays the same?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Mattias Gaertner
On Mon, 01 Dec 2008 15:06:45 +
Martin Friebe [EMAIL PROTECTED] wrote:

 Florian Klaempfl wrote:
[...]
 My opinion is that it should be the programmers choice. I a
 programmer wants or needs a simpler way (keeping all the strings in
 is application in one format, which will be known to him) then he/she
 should have that choice. And then on this type the person could
 perform any index or index-like operation.

About: keeping all the strings in is application in one format, which will be 
known to him

Only small programs can do that. All others use third party packages.
If you want choice, then all used third packages must support all
possible choices. Unlikely.


 That would mean that in order to avoid conversation, some functions
 of the RTL would be needed in overloaded versions for each string
 type. IMHO this applies only to those, which do not (or not always)
 make calls to the OS. Any other function does the conversation
 anyway. (It will be a case by case base)

Sorry, I can't follow here.
Please enlighten me, why an overloaded function with an internal
conversion is better than an implicit conversion?


[...]
 Also it would be nice (so I do not know how) not to have to duplicate 
 code, in order to archive this. Something like generics, maybe.

The goal of RTLString is to avoid duplicate code in the RTL.


Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Marco van de Voort
In our previous episode, Mattias Gaertner said:
  
  I see this as a theoretic consideration. Please give a real world (!)
  code example when this causes a problem.
 
 Can you give a real world example where a different RTLString for
 each platform solves a problem?

It avoids pingpong repeated conversions between OS encoding and whatever
encoding is default.
 
  If you assign the result of an rtl function to an rtlstring, this
  means you don't care about the type of rtlstring[1] or the knowledge,
  that it's type is rtlchar is enough for you. If you assign it to an
  ansistring/widestring whatever, you know what you get.
 
 What string type will be TStrings.Items and the many other strings in
 the classes.pp?

IMHO rtlstring.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Felipe Monteiro de Carvalho
On Mon, Dec 1, 2008 at 10:13 AM, Florian Klaempfl
[EMAIL PROTECTED] wrote:
 I assume that the new variable encoding type would be used for all
 unicode routines, am I right?

 No, it will be RTLString which type depends on the OS.

The more I think about it the more I like this solution. I think it's
better then the previous idea of a string with encode information
inside it.

-- 
Felipe Monteiro de Carvalho
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Marco van de Voort
In our previous episode, Mattias Gaertner said:
  Florian Klaempfl wrote:
 [...]
  My opinion is that it should be the programmers choice. I a
  programmer wants or needs a simpler way (keeping all the strings in
  is application in one format, which will be known to him) then he/she
  should have that choice. And then on this type the person could
  perform any index or index-like operation.
 
 About: keeping all the strings in is application in one format, which will be 
 known to him

This is not possible, since you don't control OS + headers. Most stuff will
come from the outside in the system encoding. 

This way you can do the whole app in the system encoding, and only face
conversions when outputing to the GUI, which is (relatively) infinitely slow
anyway.

You did nail a big problem though, and a weakness in Delphi's design.
What to do with classes that are used both straight and in the GUI?

 Only small programs can do that. All others use third party packages.
 If you want choice, then all used third packages must support all
 possible choices. Unlikely.

If you want to be the  lowest common denomitor and ask nothing from the 3rd
party packages, it is the best to stay with Ascii.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Felipe Monteiro de Carvalho
On Mon, Dec 1, 2008 at 5:50 PM, Marco van de Voort [EMAIL PROTECTED] wrote:
 You did nail a big problem though, and a weakness in Delphi's design.
 What to do with classes that are used both straight and in the GUI?

You mean like TStrings?

I think we will eventually roll our own TUTF8Strings

We could add a unit in FPC for all kinds of UTF-8 versions of routines.

-- 
Felipe Monteiro de Carvalho
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Felipe Monteiro de Carvalho
On Mon, Dec 1, 2008 at 5:40 PM, Florian Klaempfl [EMAIL PROTECTED] wrote:
 What string type will be TStrings.Items and the many other strings in
 the classes.pp?

 Not yet decided though I'd make them RTLString as well.

I think you can't change TStrings because that would break all code
using it (huges amount of code).

I would recommend adding a similar class with a different name.

-- 
Felipe Monteiro de Carvalho
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Mattias Gaertner
On Mon, 01 Dec 2008 20:40:14 +0100
Florian Klaempfl [EMAIL PROTECTED] wrote:

 Mattias Gaertner schrieb:
  On Mon, 01 Dec 2008 16:36:23 +0100
  Florian Klaempfl [EMAIL PROTECTED] wrote:
  
  [...] Martin Friebe schrieb:
  I can not see how I can interpret RtlString[1]. If the result is
  bigger than 128, then I must know what type it is. If it is ANSI,
  it is a single byte char. If it is utf8, it is a sub-codepoint
  which will be part of a codepoint.
  If it is widestring, well yes, here breaks my assumption that
  RtlString[1] returns a byte ouch
 
  I see this as a theoretic consideration. Please give a real world
  (!) code example when this causes a problem.
  
  Can you give a real world example where a different RTLString for
  each platform solves a problem?
 
 It solves for example the problem that there are platforms where no
 unicode support is available or desired 

:)

 and it avoids unneeded conversions. 

I understand it 'avoids unneeded conversions' *inside* the RTL, by
adding implicit conversions to the code accessing the RTL.

 I'd be fine using utf-16 on all platforms :)

Me2. At least for the file functions.
I have some doubt about the classes.pp.


  If you assign the result of an rtl function to an rtlstring, this
  means you don't care about the type of rtlstring[1] or the
  knowledge, that it's type is rtlchar is enough for you. If you
  assign it to an ansistring/widestring whatever, you know what you
  get.
  
  What string type will be TStrings.Items and the many other strings
  in the classes.pp?
 
 Not yet decided though I'd make them RTLString as well.

:(
TStrings is dog slow and the only reason, why it was still reasonable
was assigning strings was only reference counting.
If TStrings uses a platform dependent string, this is a big
performance problem.


Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Mattias Gaertner
On Mon, 1 Dec 2008 20:44:32 +0100 (CET)
[EMAIL PROTECTED] (Marco van de Voort) wrote:

 In our previous episode, Mattias Gaertner said:
   
   I see this as a theoretic consideration. Please give a real world
   (!) code example when this causes a problem.
  
  Can you give a real world example where a different RTLString for
  each platform solves a problem?
 
 It avoids pingpong repeated conversions between OS encoding and
 whatever encoding is default.

A real world example please.


Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Marco van de Voort
In our previous episode, Felipe Monteiro de Carvalho said:
  You did nail a big problem though, and a weakness in Delphi's design.
  What to do with classes that are used both straight and in the GUI?
 
 You mean like TStrings?
 
 I think we will eventually roll our own TUTF8Strings
 
 We could add a unit in FPC for all kinds of UTF-8 versions of routines.

Doesn't work per se. Tstringlist is also used in libraries, to save GUI
parts etc.

A better solution would be to simply not try to fix this and give lazarus their
own copy of said classes, so that they can keep the encoding of that in sync
with whatever they decide for their own encoding.

That would actually require less fixups (a few conversions procedures for
the rare points where tlclstrings are passed to e.g. registry units. Lazarus
already has their own XML units). 
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Jonas Maebe


On 01 Dec 2008, at 20:57, Felipe Monteiro de Carvalho wrote:

On Mon, Dec 1, 2008 at 5:40 PM, Florian Klaempfl [EMAIL PROTECTED] 
 wrote:
What string type will be TStrings.Items and the many other strings  
in

the classes.pp?


Not yet decided though I'd make them RTLString as well.


I think you can't change TStrings because that would break all code
using it (huges amount of code).

I would recommend adding a similar class with a different name.


In that case, I would recommend giving it the string with attached  
encoding style type so you don't need 5 tstrings variants.


Regarding how to deal with file system representations, conversions  
etc, it may also be interesting to look at Apple's NSString class (http://developer.apple.com/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/Reference/NSString.html 
) or, if you prefer a procedural approach, CFStrings (http://developer.apple.com/documentation/CoreFoundation/Reference/CFStringRef/index.html 
)


I'm not suggesting to mimik that exact API, but only to see what kind  
of APIs they support (and are deprecating). NSString/CFString (one is  
just an OOP version of the other) are also a universal string  
container type, with embedded encoding.


For example, there are routines such as
* CFStringGetCharacterAtIndex() (and more optimised approaches as  
documented there, such as CFStringGetRangeOfComposedCharactersAtIndex())
* CFStringGetFileSystemRepresentation() (basically the rtlstring  
version of the string)
* CFStringConvertWindowsCodepageToEncoding() (Returns the Core  
Foundation encoding constant that is the closest mapping to a given  
Windows codepage identifier.)

* ...

The advantage when using such a type is that you also only need to  
convert it (internally, hidden from the user) on demand or when some  
helper routine requires it (such as e.g. case-insensitive  
comparisons). Otherwise, no conversion whatsoever is necessary.



Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Mattias Gaertner
On Mon, 1 Dec 2008 17:53:58 -0200
Felipe Monteiro de Carvalho [EMAIL PROTECTED] wrote:

 On Mon, Dec 1, 2008 at 5:50 PM, Marco van de Voort [EMAIL PROTECTED]
 wrote:
  You did nail a big problem though, and a weakness in Delphi's
  design. What to do with classes that are used both straight and in
  the GUI?
 
 You mean like TStrings?
 
 I think we will eventually roll our own TUTF8Strings

This must be added to the classes.pp and TStrings must know it, so that
Assign et al works.


 We could add a unit in FPC for all kinds of UTF-8 versions of
 routines.

Yes, that is a good idea anyway - independent of RTLString and the
current topic. Perhaps this should be discussed in a separate thread.


Mattias
 
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Marco van de Voort
In our previous episode, Felipe Monteiro de Carvalho said:
 On Mon, Dec 1, 2008 at 5:40 PM, Florian Klaempfl [EMAIL PROTECTED] wrote:
  What string type will be TStrings.Items and the many other strings in
  the classes.pp?
 
  Not yet decided though I'd make them RTLString as well.
 
 I think you can't change TStrings because that would break all code
 using it (huges amount of code).

Depends. The few last msgs kept me thinking, and if what I saw on the web
about Tiburon is correct, they simply control the type of ansistring in
tstringlist, and default let it be the system encoding. (default like in old
delphi). For unicodecontrols they set some new property or so to change it
to UTF8, and take the conversion penalties for granted.

This allows them to do 

  MyStringList.SaveToFile('SomeFilename.txt', TEncoding.Unicode);  

It has to be something like that, since if mystringlist always was
ansistring in whatever ISO encoding, that would be a pretty pointless
unicode control.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Marco van de Voort
In our previous episode, Jonas Maebe said:

(nsstring)

 The advantage when using such a type is that you also only need to  
 convert it (internally, hidden from the user) on demand or when some  
 helper routine requires it (such as e.g. case-insensitive  
 comparisons). Otherwise, no conversion whatsoever is necessary.

Do they have some way to indicate that a procedure/method only supports a
certain encoding? Or do you have to manually force the encoding in that way?

I prefer a declarative way to solve this.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Marco van de Voort
In our previous episode, Mattias Gaertner said:
 
  and it avoids unneeded conversions. 
 
 I understand it 'avoids unneeded conversions' *inside* the RTL, by
 adding implicit conversions to the code accessing the RTL.

It allows the user to stay conversion free, and have some control over how
many conversions are being done. 

It is way better than making this decision for him, and forcing him to an
encoding he normally wouldn't use in the first place.
 
  I'd be fine using utf-16 on all platforms :)
 
 Me2. At least for the file functions.

I would too. If all platforms had chosen it. But they didn't. 

  Not yet decided though I'd make them RTLString as well.
 
 :(
 TStrings is dog slow and the only reason, why it was still reasonable
 was assigning strings was only reference counting.
 If TStrings uses a platform dependent string, this is a big
 performance problem.

Because exactly why? See also my previous msg.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Jonas Maebe


On 01 Dec 2008, at 21:17, Marco van de Voort wrote:


In our previous episode, Jonas Maebe said:

(nsstring)


The advantage when using such a type is that you also only need to
convert it (internally, hidden from the user) on demand or when some
helper routine requires it (such as e.g. case-insensitive
comparisons). Otherwise, no conversion whatsoever is necessary.


Do they have some way to indicate that a procedure/method only  
supports a

certain encoding?


No.


Or do you have to manually force the encoding in that way?


Yes.


I prefer a declarative way to solve this.



In the Pascal case, you could simply declare your parameter as  
UTF8String (or whatever) and the compiler would insert a conversion  
from this universal string type into a utf8string.



Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Felipe Monteiro de Carvalho
On Mon, Dec 1, 2008 at 6:22 PM, Mattias Gaertner
[EMAIL PROTECTED] wrote:
 Compatibility was always the bigger goal for lazarus. IMHO a TLCLStrings
 breaks more than it would solve.

I don't fully understand how the Tiburon TStrings works, but consider
that we are used to mixing TStrings with LCL code, and then we migrate
to the proposed UTF8String.

Now every assignment of a string to TStrings will have a implicit conversion.

Unless there are many overloaded methods in TStrings, one for each
encoding, and it is able to internally use our desired encoding so
that no useless convertions are done.

If this requirements aren't met, we need a new class.

-- 
Felipe Monteiro de Carvalho
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Marco van de Voort
In our previous episode, Mattias Gaertner said:
  encoding.
  
  That would actually require less fixups (a few conversions procedures
  for the rare points where tlclstrings are passed to e.g. registry
  units. Lazarus already has their own XML units). 
 
 Only at places where we had the choice. The LCL uses the FCL xml units.

As said it can be fixed. Maybe even easier than I thought.

 Compatibility was always the bigger goal for lazarus. IMHO a TLCLStrings
 breaks more than it would solve.

A lot will change. Even with Delphi not everything automatically is unicode,
and they only have one platform to regard.

I usually am sb who is pretty serious about Delphi compatibility, except for
some of the more bizar post D7 experiments. This however is different. While
I really like Tiburon as Delphi user, I loathe it as FPC user.

It will never be totally transparent, whatever you do. Just like something
things never were transparent when porting to Linux.

We are a multi OS compiler, not a version of Wine oriented towards Pascal
development.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Graeme Geldenhuys
On Mon, Dec 1, 2008 at 10:03 PM, Mattias Gaertner
[EMAIL PROTECTED] wrote:

 and it avoids unneeded conversions.

 I understand it 'avoids unneeded conversions' *inside* the RTL, by
 adding implicit conversions to the code accessing the RTL.

This is exactly what I was thinking. The conversion is simply passed
on to a different piece of code. So  the end result is the same - you
still have conversion.


Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Felipe Monteiro de Carvalho
On Mon, Dec 1, 2008 at 6:22 PM, Mattias Gaertner
[EMAIL PROTECTED] wrote:
 Compatibility was always the bigger goal for lazarus. IMHO a TLCLStrings
 breaks more than it would solve.

You mean compatibility with Delphi?

With Tiburon I think this will become very hard, if possible at all.
We can, however, keep compatible with previous Delphi versions.

-- 
Felipe Monteiro de Carvalho
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Marco van de Voort
In our previous episode, Graeme Geldenhuys said:
  and it avoids unneeded conversions.
 
  I understand it 'avoids unneeded conversions' *inside* the RTL, by
  adding implicit conversions to the code accessing the RTL.
 
 This is exactly what I was thinking. The conversion is simply passed
 on to a different piece of code. So  the end result is the same - you
 still have conversion.

Not necesarily, since you might not use a different type. Or only use them
in a few rare cases where you must do character level access.

Or you might convert to UTF-32, because the char level access is
particularly difficult, or you want to be correct.

IOW, you give the programmer the choice about the type, instead of forcing
him an arbitrary one, based on your favorite platform.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Felipe Monteiro de Carvalho
On Mon, Dec 1, 2008 at 6:38 PM, Marco van de Voort [EMAIL PROTECTED] wrote:
 In our previous episode, Graeme Geldenhuys said:
 IOW, you give the programmer the choice about the type, instead of forcing
 him an arbitrary one, based on your favorite platform.

This is the part I like about this approach. The most likely fixed
encoding to be adopted would be UTF-16, and something not very nice
would happen to Lazarus users in UNIXes:

LCL (UTF-8) -- RTL (UTF-16) --- Operating System (UTF-8)

2 useless conversions.

-- 
Felipe Monteiro de Carvalho
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Marco van de Voort
In our previous episode, Jonas Maebe said:

  Do they have some way to indicate that a procedure/method only supports
  a certain encoding?
 
 No.
 
  Or do you have to manually force the encoding in that way?
 
 Yes.

Clear. I just wondered how they solved it.

  I prefer a declarative way to solve this.
 In the Pascal case, you could simply declare your parameter as  
 UTF8String (or whatever) and the compiler would insert a conversion  
 from this universal string type into a utf8string.

I know. With that modification I thought that was the best too until Tiburon 
details
emerged.

Actually I still think that our original is the best, not considering
compatibility issues, but I don't think the difference is worth losing at least
base level Tiburon compatibility over. Specially because their solution has
some advantages too (can more gradually change ansistring code)
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Mattias Gaertner
On Mon, 1 Dec 2008 21:07:50 +0100 (CET)
[EMAIL PROTECTED] (Marco van de Voort) wrote:

 In our previous episode, Felipe Monteiro de Carvalho said:
   You did nail a big problem though, and a weakness in Delphi's
   design. What to do with classes that are used both straight and
   in the GUI?
  
  You mean like TStrings?
  
  I think we will eventually roll our own TUTF8Strings
  
  We could add a unit in FPC for all kinds of UTF-8 versions of
  routines.
 
 Doesn't work per se. Tstringlist is also used in libraries, to save
 GUI parts etc.
 
 A better solution would be to simply not try to fix this and give
 lazarus their own copy of said classes, so that they can keep the
 encoding of that in sync with whatever they decide for their own
 encoding.
 
 That would actually require less fixups (a few conversions procedures
 for the rare points where tlclstrings are passed to e.g. registry
 units. Lazarus already has their own XML units). 

Only at places where we had the choice. The LCL uses the FCL xml units.

Compatibility was always the bigger goal for lazarus. IMHO a TLCLStrings
breaks more than it would solve.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Marco van de Voort
In our previous episode, Felipe Monteiro de Carvalho said:
 This is the part I like about this approach. The most likely fixed
 encoding to be adopted would be UTF-16, and something not very nice
 would happen to Lazarus users in UNIXes:
 
 LCL (UTF-8) -- RTL (UTF-16) --- Operating System (UTF-8)
 
 2 useless conversions.

Btw will the LCL remain forcedly UTF-8 ? I thought the current Lazarus
unicode support was temporary and all options were still open, depending on
the outcome of FPC unicode support options?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Felipe Monteiro de Carvalho
On Mon, Dec 1, 2008 at 6:48 PM, Marco van de Voort [EMAIL PROTECTED] wrote:
 Btw will the LCL remain forcedly UTF-8 ? I thought the current Lazarus
 unicode support was temporary and all options were still open, depending on
 the outcome of FPC unicode support options?

It is certainly not temporary, also considering people won't be very
happy to see us make a big incompatible change right after telling
them to convert their source code to UTF-8. I think we have a
responsability to stay coherent here.

I think we may consider migrating to UTF8String when it's implemented
and if it proves a viable solution.

Not to mention: What would the alternative be?

-- 
Felipe Monteiro de Carvalho
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Mattias Gaertner
On Mon, 1 Dec 2008 18:45:46 -0200
Felipe Monteiro de Carvalho [EMAIL PROTECTED] wrote:

 On Mon, Dec 1, 2008 at 6:38 PM, Marco van de Voort [EMAIL PROTECTED]
 wrote:
  In our previous episode, Graeme Geldenhuys said:
  IOW, you give the programmer the choice about the type, instead of
  forcing him an arbitrary one, based on your favorite platform.
 
 This is the part I like about this approach. The most likely fixed
 encoding to be adopted would be UTF-16, and something not very nice
 would happen to Lazarus users in UNIXes:
 
 LCL (UTF-8) -- RTL (UTF-16) --- Operating System (UTF-8)
 
 2 useless conversions.

The LCL is a visual component library. It's string speed is slow anyway.
Except maybe for TMemo.Lines and running through a big TreeView.

Same is true for file functions. The OS overhead checking for
permissions and the other file system issues makes even a triple
encoding/decoding a non issue. For example under Mac OS X the lazarus
IDE uses the CFString functions to compare a filename. This normalizes
the string each time. CompareFilenames is used easily hundred thousands
of time and no one said yet, that the IDE runs slowly under OS X.

It's a different thing for TStrings. Many algorithms need the O(1)
time accessing a Items[i].


Mattias

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Marco van de Voort
In our previous episode, Felipe Monteiro de Carvalho said:
 On Mon, Dec 1, 2008 at 6:48 PM, Marco van de Voort [EMAIL PROTECTED] wrote:
  Btw will the LCL remain forcedly UTF-8 ? I thought the current Lazarus
  unicode support was temporary and all options were still open, depending on
  the outcome of FPC unicode support options?
 
 It is certainly not temporary, also considering people won't be very
 happy to see us make a big incompatible change right after telling
 them to convert their source code to UTF-8. I think we have a
 responsability to stay coherent here.

Well, euh, you will need a change anyway from manual to automated ?

 I think we may consider migrating to UTF8String when it's implemented
 and if it proves a viable solution.
 
 Not to mention: What would the alternative be?

Well, the logical one of course:

RTLString, IOW encoding platform dependant. Except maybe selected widgets like
synedit. (Borland stores source in utf-8 too on windows)

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Luca Olivetti

En/na Felipe Monteiro de Carvalho ha escrit:


LCL (UTF-8) -- RTL (UTF-16) --- Operating System (UTF-8)


Is the last step always true? Doesn't qt support utf-16?

Bye
--
Luca
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Felipe Monteiro de Carvalho
On Mon, Dec 1, 2008 at 7:03 PM, Luca Olivetti [EMAIL PROTECTED] wrote:
 LCL (UTF-8) -- RTL (UTF-16) --- Operating System (UTF-8)

 Is the last step always true? Doesn't qt support utf-16?

This is for operating system calls, not graphical library calls.

-- 
Felipe Monteiro de Carvalho
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Felipe Monteiro de Carvalho
On Mon, Dec 1, 2008 at 7:01 PM, Marco van de Voort [EMAIL PROTECTED] wrote:
 RTLString, IOW encoding platform dependant. Except maybe selected widgets like
 synedit. (Borland stores source in utf-8 too on windows)

A string whose encoding is unknown is very inconvenient for
developers. The idea just saves itself in the RTL because of the
eventual need to do some extremely high performance applications. For
Lazarus it would be simply a useless inconvenience.

-- 
Felipe Monteiro de Carvalho
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Florian Klaempfl
Felipe Monteiro de Carvalho schrieb:
 On Mon, Dec 1, 2008 at 7:01 PM, Marco van de Voort [EMAIL PROTECTED] wrote:
 RTLString, IOW encoding platform dependant. Except maybe selected widgets 
 like
 synedit. (Borland stores source in utf-8 too on windows)
 
 A string whose encoding is unknown is very inconvenient for
 developers. The idea just saves itself in the RTL because of the
 eventual need to do some extremely high performance applications. For
 Lazarus it would be simply a useless inconvenience.
 

So how did people work for years with ansistring?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Marco van de Voort
In our previous episode, Felipe Monteiro de Carvalho said:
  RTLString, IOW encoding platform dependant. Except maybe selected widgets 
  like
  synedit. (Borland stores source in utf-8 too on windows)
 
 A string whose encoding is unknown is very inconvenient for
 developers. 

I don't see that so strongly as most. 

 The idea just saves itself in the RTL because of the
 eventual need to do some extremely high performance applications. For
 Lazarus it would be simply a useless inconvenience.

Same as above. That is an opinion, not fact, and I don't agree.

It is btw not just about performance, but also about predictability. Less
encodings in use, means better preditability. 

If RTL+LCL are in the system encoding (with LCL mostly hiding odd ball libs
as QT if that is the widgetset), you have a fair chance not to have a
multiencoding app, without thick layers of emu. Or at least keep the
encoding dependant part localised to a fairly small part of the program.

To be honest, I think a case for LCL follows widget set encoding could also
be made.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Marco van de Voort
In our previous episode, Florian Klaempfl said:
  A string whose encoding is unknown is very inconvenient for
  developers. The idea just saves itself in the RTL because of the
  eventual need to do some extremely high performance applications. For
  Lazarus it would be simply a useless inconvenience.
 
 So how did people work for years with ansistring?

Depends on country I guess. Here one simply skips accents, except for a few
apps/fields where e.g. own names (of persons, cities etc) are used. At
least until recent years.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Felipe Monteiro de Carvalho
On Mon, Dec 1, 2008 at 7:24 PM, Florian Klaempfl [EMAIL PROTECTED] wrote:
 So how did people work for years with ansistring?

A ansistring used in the way proposed by FPC is extremely inconvenient
for any GUI application which will be run in different parts of the
globe. You develop a application in a russian machine, sends it to a
english machine and it shows rubbish instead of text. Even if you
actually could read that russian GUI.

It introduces dependency of what will be shown at runtime with the
operating system you are running it. It's exactly the mess Unicode was
invented to end with.

People worked for years with ansistring suffering from it's short comings.

-- 
Felipe Monteiro de Carvalho
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Martin Friebe

Luiz Americo Pereira Camara wrote:

Martin Friebe escreveu:

All the code
 Widestring := RtlFunction;
 Utf8string := RtlFunction;
will run, it may just perform badly.


Yes and no.

Let's assume the platforms windows and unix having UnicodeString 
(UTF-16) and UTF8String as native types respectively.

You choose to use UnicodeString type in your app.

Using the rtlstring approach you get:

Under windows: the native string type of platform is the same as you 
are using no conversion is taken. Good.
Under unix: the native string type of platform is NOT the same as you 
are using ONE conversion is taken. Bad.


Now let's assume that fpc team decided to use a fixed unicode encoding 
for the RTL. Let's say a UnicodeString RTL.

You choose to use UnicodeString type in your app.
I never suggested the RTL to be in a fixed encoding. I fully agree that 
this would be far worse.


I suggested to have a rtl, that has overloaded functions for each string 
type.

of course that sounds easier than in fact it will be.

Florian pointed out a few issues, like overloading by result is not 
possible (yet?). And code duplication would be a maintenance hell.
But those limits can be overcome. Maybe not in full for the first 
Unicode fpc release.


I can see that in order to get at least something (and in a way forward 
compatible) to all the waiting users of fpc, the RTLString solution is a 
good solution (or compromise: full function, limited optimization).


The functions that can be overloaded with what fpc already has, could be 
written for the various types. Maybe a template system for plain 
functions (like generics for objects) could be found? So code would not 
be duplicated.
Maybe fpc could be extended to allow overloading by result? (sure that 
has other uses too?)

It's just a suggestion.

Best Regards
Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Felipe Monteiro de Carvalho
On Mon, Dec 1, 2008 at 7:33 PM, Martin Friebe [EMAIL PROTECTED] wrote:
 I suggested to have a rtl, that has overloaded functions for each string
 type.
 of course that sounds easier than in fact it will be.

This is about the same as having all string routines in 3 flavours:
RTLString, utf-8 and utf-16

the utf-8 and utf-16 could be done by assigning rtlstring to the adequate type.

I think this is probably what we will end up with, because users of a
particular encoding will build convenience routines for their favorite
RTL routines.

-- 
Felipe Monteiro de Carvalho
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unicode and UTF8String

2008-12-01 Thread Florian Klaempfl
Felipe Monteiro de Carvalho schrieb:
 On Mon, Dec 1, 2008 at 7:24 PM, Florian Klaempfl [EMAIL PROTECTED] wrote:
 So how did people work for years with ansistring?
 
 A ansistring used in the way proposed by FPC is extremely inconvenient
 for any GUI application which will be run in different parts of the
 globe. 

I meant more that a lot of people simply ignored in their code that
ansistrings could be also multibyte even not considering UTF-8.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


  1   2   >