Re: [Lazarus] String vs WideString

2017-08-17 Thread Luca Olivetti via Lazarus

El 17/08/17 a les 01:34, Graeme Geldenhuys via Lazarus ha escrit:

On 2017-08-16 19:26, Luca Olivetti via Lazarus wrote:

I mean, TBytes is just an "array of char".


NO!  Char can now mean a 1-byte char or a 2-byte char (I don't know how 


Sorry, I meant "array of byte". The point is it doesn't have all the 
features of a string.


Bye
--
Luca Olivetti
Wetron Automation Technology http://www.wetron.es/
Tel. +34 93 5883004 (Ext.3010)  Fax +34 93 5883007
--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] String vs WideString

2017-08-17 Thread Ondrej Pokorny via Lazarus

On 17.08.2017 16:34, Graeme Geldenhuys via Lazarus wrote:

On 2017-08-17 13:40, Marcos Douglas B. Santos via Lazarus wrote:

Sorry, but every single warning is a... warning... that needs to be
resolved.



I feel exactly the same. :-)  It took me ages to figure out how to 
change my code so I could get rid of the "variable not initialized" 
whenever you used FillChar().


And what do you use? The Default() intrinsic function?

Ondrej
--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] String vs WideString

2017-08-17 Thread Graeme Geldenhuys via Lazarus

On 2017-08-17 13:40, Marcos Douglas B. Santos via Lazarus wrote:

Sorry, but every single warning is a... warning... that needs to be
resolved.



I feel exactly the same. :-)  It took me ages to figure out how to 
change my code so I could get rid of the "variable not initialized" 
whenever you used FillChar().


Regards,
  Graeme

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

My public PGP key:  http://tinyurl.com/graeme-pgp
--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] String vs WideString

2017-08-17 Thread Sven Barth via Lazarus
Am 17.08.2017 12:17 schrieb "Bart via Lazarus" <
lazarus@lists.lazarus-ide.org>:
>
> On 8/17/17, Sven Barth via Lazarus  wrote:
>
> >> really? delphi came from TP/BP... i was (still am, actually) using
> > dynamic arrays in TP6 ;)
> >
> > Dynamic arrays in the form of "array of Type" were only introduced in
> > Delphi 3 if I remember correctly. Anything before that needed manual
memory
> > management.
>
> I had D3 Pro, and this did definitively NOT support dynamic arrays.
> (Even String still was ShortString.)
> All arrays had to be fixed range.
> The often used construct to bypass this limitation was: Array[0..0] of
> TSomeType and have Range checking of.

Then it was Delphi 4 ^^'

Regards,
Sven
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] String vs WideString

2017-08-17 Thread Sven Barth via Lazarus
Am 17.08.2017 14:32 schrieb "Michael Schnell via Lazarus" <
lazarus@lists.lazarus-ide.org>:
>
> On 17.08.2017 12:09, Bart via Lazarus wrote:
>>
>>
>> Variables of the ordinal type Char are used to store ASCII characters."
>>
>>
> According to this wording, using Windows with ANSI character set would be
a no-go.

Bart quoted from the TP help. And TP was written for DOS. There wasn't any
Unicode or ANSI around yet...

Regards
Sven
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] String vs WideString

2017-08-17 Thread Marcos Douglas B. Santos via Lazarus
On Wed, Aug 16, 2017 at 12:38 PM, Juha Manninen via Lazarus
 wrote:
> On Wed, Aug 16, 2017 at 5:48 PM, Marcos Douglas B. Santos via Lazarus
>> Are you saying that I need to do this?
>> (following the firt example on this thread)
>
> No, if the parameter is WideString, not a pointer PWideChar, you can
> just call it like you did. Suppress the warning as Mattias told if it
> bothers you. You can also make a helper function so the conversion
> happens in one place.
> Yes, for OLE you need WideString.

"Suppress the warning as Mattias told if it bothers you"

Of course bothers me.
Sorry, but every single warning is a... warning... that needs to be resolved.
If this is not a problem (or a possible future problem), the compiler
should not give us a warning, right?

Best regards,
Marcos Douglas
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] String vs WideString

2017-08-17 Thread Michael Schnell via Lazarus

On 17.08.2017 12:41, Tony Whyman via Lazarus wrote:


Finally: "In UTF-16, code points greater or equal to 2^16 are encoded
using /two/ 16-bit code units.

2¹⁵ ???
-Michael-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] String vs WideString

2017-08-17 Thread Michael Schnell via Lazarus



On 17.08.2017 12:41, Tony Whyman via Lazarus wrote:


UCS-2 differs from UTF-16 by being a constant length encoding and only 
capable of encoding characters of BMP, it is supported by many programs."
Rather obviously Embarcadero primarily had UCS-2 in mind as they created 
the "Unicode aware" Delphi. While it in fact does support full Unicode, 
keeping MyChar:=MyString[i] in place suggests to presume UCS-2 coded 
text for "unaware" programmers.


-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] String vs WideString

2017-08-17 Thread Michael Schnell via Lazarus

On 17.08.2017 12:09, Bart via Lazarus wrote:


Variables of the ordinal type Char are used to store ASCII characters."


According to this wording, using Windows with ANSI character set would 
be a no-go.


-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] String vs WideString

2017-08-17 Thread Tony Whyman via Lazarus

On 16/08/17 11:05, Juha Manninen via Lazarus wrote:

2. Clean up the char type.

...
Why shouldn't there be a single char type that intuitively represents
a single character regardless of how many bytes are used to represent it.

What do you mean by "a single character"?
A "character" in Unicode can mean about 7 different things. Which one
is your pick?
This question is for everybody in this thread who used the word "character".
Are you making my points for me? If such a basic term as "character" 
means 7 different things then something is badly amiss. It should be 
fairly obvious that in this context, character = printable symbol - 
whilst for practical reasons allowing for format control characters such 
as a "end of line" and "end of string".


I believe that you need to go back to the idea that you have both an 
abstract representation of a character with a constant semantic, 
separate from the actual encoding and for which there may be many 
different and valid encodings. For example, using a somewhat dated 
comparison, a lower case latin alphabet letter 'a' should always have a 
constant semantic, but in ASCII is encoded as decimal 97, while in 
EBCDIC is encoded as decimal 129. Even though they have different binary 
values, the represent the same abstract character.


I want a 'char' type in Pascal to represent a character such as a lower 
case 'a' regardless of the encoding used. Indeed, for a program to be 
properly portable, the programmer should not have to care are the actual 
encoding - only that it is a lower case 'a'.


Hence my proposal that a character type should include an implicit or 
explicit attribute that records the encoding scheme used - which could 
vary from ASCII to UTF-32.


You can then go on to define a text string as an array of characters 
with the same encoding scheme.



Yes, in a world where we have to live with UTF8, UTF16, UTF32, legacy code
pages and Chinese variations on UTF8, that means that dynamic attributes
have to be included in the type. But isn't that the only way to have
consistent and intuitive character handling?

What do you mean? Chinese don't have a variation of UTF8.
UTF8 is global unambiguous encoding standard, part of Unicode.


I was referring to GB 18030 and that it has one, two and four byte code 
points.


The fundamental problem is that you want to hide the complexity of
Unicode by some magic String type of a compiler.
It is not possible. Unicode remains complex but the complexity is NOT
in encodings!
No, a codepoint's encoding is the easy part. For example I was easily
able to create a unit to support encoding agnostic code. See unit
LazUnicode in package LazUtils.
The complexity is elsewhere:
- "Character" composed of codepoints in precomposed and decomposed
(normalized) forms.
- Compare and sort text based on locale.
- Uppercase / Lowercase rules based on locale.
- Glyphs
- Graphemes
- etc.

I must admit I don't understand well those complex parts.
I do understand codeunits and codepoints, and I understand they are
the easy part.

Juha
The point I believe that you are missing is to consider that a character 
is an abstract symbol with a semantic independent of how it is encoded. 
Collation sequences are independent of encoding and should remain the 
same regardless of how a character set is encoded.

--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] String vs WideString

2017-08-17 Thread Tony Whyman via Lazarus

On 16/08/17 11:05, Juha Manninen via Lazarus wrote:

On Mon, Aug 14, 2017 at 4:21 PM, Tony Whyman via Lazarus
 wrote:

UTF-16/Unicode can only store 65,536 characters while the Unicode standard
(that covers UTF8 as well) defines 136,755 characters.
UTF-16/Unicode's main advantage seems to be for rapid indexing of large
strings.

That shows complete ignorance from your side about Unicode.
You consider UTF-16 as a fixed-width encoding.  :(
Unfortunately many other programmers had the same wrong idea or they
were just lazy. The result anyway is a lot of broken UTF-16 code out
there.
You do like to use the word "ignorance" don't you. You can if you want 
take the view that all the "other programmers" that got the wrong idea 
are "stupid monkeys that don't know any better"  or, alternatively, that 
they just wanted a nice cup of tea rather than the not quite tea drink 
that was served up.


Wikipedia sums the problem up nicely: "The early 2-byte encoding was 
usually called "Unicode", but is now called "UCS-2". UCS-2 differs from 
UTF-16 by being a constant length encoding and only capable of encoding 
characters of BMP, it is supported by many programs."


This is where the problem starts. The definitive of "Unicode" was 
changed (foolishly in my opinion) after it had been accepted by the 
community and the result is confusion. Hence my first point about not 
even using it. In using "UTF16/Unicode" I was attempting to convey the 
common use of the term which is to see UTF-16 as what is now defined as 
UCS-2. This is because hardly anyone I know uses UCS-2 and instead says 
"Unicode". Perhaps I just spend too much time amongst the ignorant.


Wikipedia also makes the wonderful point that "The UTF-16 encoding 
scheme was developed as a compromise to resolve this impasse in version 
2.0". The impasse having resulted from "4 bytes per character wasted a 
lot of disk space and memory, and because some manufacturers were 
already heavily invested in 2-byte-per-character technology".


Finally: "In UTF-16, code points greater or equal to 2^16 are encoded 
using /two/ 16-bit code units. The standards organizations chose the 
largest block available of un-allocated 16-bit code points to use as 
these code units (since most existing UCS-2 data did not use these code 
points and would be valid UTF-16). Unlike UTF-8 they did not provide a 
means to encode these code points".


Which is from where I get my own view that UTF-16, as defined by the 
standards, is pointless. If you keep it to a UCS-2 (like) subset then 
you can get rapid indexing of character arrays. But as soon as you 
introduce the possibility of some characters being encoded as two 16-bit 
units then you lose rapid indexing and I can see no advantage over UTF-8 
- plus you get all the fun of worrying about byte order.


Indeed, I believe those lazy programmers that you referred to, are 
actually making a conscious decision to prefer to work with a 16-bit 
code point only UTF-16 subset (i.e. the Basic Multilingual Plan) 
precisely so that they can do rapid indexing. As soon as you bring in 2 
x 16-bit code unit code points, you lose that benefit - and perhaps you 
should be using UTF-32.


IMHO, Linux has got it right by using UTF-8 as the standard for 
character encoding and one of Lazarus's USPs is that it follows that 
lead - even for Windows. I can see why a program that does intensive 
text scanning will use a UTF-16 constrained to the BMP (i.e. 16-bit 
only), but not why anyone would prefer an unconstrained UTF-16 over UTF-8.


-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] String vs WideString

2017-08-17 Thread Bart via Lazarus
On 8/17/17, Sven Barth via Lazarus  wrote:

>> really? delphi came from TP/BP... i was (still am, actually) using
> dynamic arrays in TP6 ;)
>
> Dynamic arrays in the form of "array of Type" were only introduced in
> Delphi 3 if I remember correctly. Anything before that needed manual memory
> management.

I had D3 Pro, and this did definitively NOT support dynamic arrays.
(Even String still was ShortString.)
All arrays had to be fixed range.
The often used construct to bypass this limitation was: Array[0..0] of
TSomeType and have Range checking of.

Bart
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] String vs WideString

2017-08-17 Thread Bart via Lazarus
On 8/17/17, Luca Olivetti via Lazarus  wrote:


> I started using strings as communication buffers since delphi 2. There
> weren't even dynamic arrays then...

From the Turbo Pascal Help:

"A string type variable is a sequence of characters ..."

And then when you click on "characters":

"Char type
 ---
Variables of the ordinal type Char are used to store ASCII characters."

None of this suggests that string is a good type for storing arbitrary
byte sequences.

You misused an implementation detail of the type (Ansi)String.
And now you blame fpc.

You should have used a sane type for your buffer from the start.

Bart
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] dynamic string proposal

2017-08-17 Thread Sven Barth via Lazarus
Am 17.08.2017 11:11 schrieb "Michael Schnell via Lazarus" <
lazarus@lists.lazarus-ide.org>:
>
> Maybe, Sven could answer to this mail in the other thread...
>

I provided an example in my answer to Tony Whyman in the same subbranch of
the thread.

Regards,
Sven
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] String vs WideString

2017-08-17 Thread Sven Barth via Lazarus
Am 17.08.2017 11:21 schrieb "Michael Schnell via Lazarus" <
lazarus@lists.lazarus-ide.org>:
>
> On 16.08.2017 22:40, Sven Barth via Lazarus wrote:
>>
>> Trunk supports Insert() and Delete() on dynamic arrays, Concat() and +
are on the near term ToDo list.
>
>
> Supposedly "pos", as well. But that does not really help if we don't have
a TStringList workalike, and supposedly several more library functions.
>
> That is why I feel empowering the string paradigm for such use would be
more appropriate. (See the thread "dynamic string proposal").

Why do you want to stuff everything and the kitchen sink into TStrings?
There are much more suitable and less specialized container types available
for this (to name a few: TFPGList, TList<>, etc.).

Regards,
Sven
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] String vs WideString

2017-08-17 Thread Michael Schnell via Lazarus

On 16.08.2017 22:40, Sven Barth via Lazarus wrote:
Trunk supports Insert() and Delete() on dynamic arrays, Concat() and + 
are on the near term ToDo list.


Supposedly "pos", as well. But that does not really help if we don't 
have a TStringList workalike, and supposedly several more library 
functions.


That is why I feel empowering the string paradigm for such use would be 
more appropriate. (See the thread "dynamic string proposal").


-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] String vs WideString

2017-08-17 Thread Michael Schnell via Lazarus

On 16.08.2017 20:26, Luca Olivetti via Lazarus wrote:


Call me lazy but I don't want to reinvent the wheel and re-implement 
from scratch the functionality that a plain ansistring provides and 
TBytes to this day doesn't.

So please continue in the thread "dynamic string proposal".

Exactly this is part of what is discussed there.

-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] dynamic string proposal

2017-08-17 Thread Michael Schnell via Lazarus

Maybe, Sven could answer to this mail in the other thread...

On 14.08.2017 18:47, Sven Barth via Lazarus wrote:


The main problem of such a dynamic type would be the inability to do 
fast indexing as the compiler would need to insert runtime checks for 
the size of a character.



What "indexing" do you think of ?
Could you give an example where such a difference is supposed to get 
important ?


(As you know I wrote a paper where I claimed the contrary. I'd like to 
revise same if necessary.)


-Michael
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus