Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-30 Thread Rolf Grunsky via Lazarus

On 10/30/2016 06:15 PM, Lars wrote:

Off topic, off list, but..


--
TRUTH in her dress finds facts too tight.
In fiction she moves with ease.
Stray Birds by Rabindranath Tagore




What does this mean?
Something about this:
http://www.the-niceguy.com/articles/Nutballs.html

The only thing I can think of is a woman wearing a dress, that is tight,
and is incapable of processing facts. I double take on complicated english
quotations.



Yah, it's way off topic but I'll bite just this once.

It's poetry. It's not to be taken literally. If your native language is 
not English, it may give you some problems.


For the record, Rabindranath Tagore was an Indian writer. The quote is 
from his book "Stray Birds".


What this couplet is saying is that sometimes it is easier to use a 
fictional account to express something true. As I've said it is poetry, 
these are metaphors, figures of speech. I find much of Tagore's language 
very beautiful. We've chosen another Tagore "Stray Bird" as the epitaph 
on our grave marker.


Here are a couple more from "Stray Birds":

The mind, sharp but not broad, sticks at every point but does not move.

and

A mind all logic is like a knife all blade.
It makes the hand bleed that uses it.

There are many more...
--
   TRUTH in her dress finds facts too tight.
   In fiction she moves with ease.
   Stray Birds by Rabindranath Tagore
--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-24 Thread Rolf Grunsky via Lazarus

On 10/22/2016 06:25 AM, Juha Manninen via Lazarus wrote:

On Sat, Oct 22, 2016 at 4:12 AM, Martin Frb via Lazarus
 wrote:

Which ones does it not support?
When I added it to SynEdit it was complete. It had all the combinings that
the utf8 standard had back then. (at least that I could find in the
documentation)

Of course if a new combining range is added, it will not contain it. If that
is needed one needs an external (OS or otherwise) library, that can/will be
updated on those occasions.

Mind "combining codepoints" have nothing to do with how many codepoints will
be represented by one glyph.


Ok, I was confusing the Unicode terms again.
I guess the biggest complexity is in glyphs and ligatures. I still
don't understand their details.
However for a program that must care about Unicode, like a text layout
app, the rules for combining codepoints and glyphs are equally
important. Codepoints for one glyph should never be split or copied
separately. Isn't it so?
SynEdit is a text layout app, too.
In that sense the function IsCombining is not enough for practical
purposes. A comprehensive library function should take care of glyphs
(+ other rules), too.

I looked at Bero's PUCU and the other links:
 
http://forum.lazarus.freepascal.org/index.php/topic,33064.msg214342.html#msg214342
but it went over my head. I must study the issue more later.

* A reality check! *
Despite problems and incompleteness of our Unicode support, it is
actually better than most other solutions out there.
Ok, most programming tools support Unicode somehow but people use them wrong.
A good example is our forum SMF software. It deals with text layout
and definitely should handle Unicode but it does not.
Not even single Codepoints beyond BMP which should be the most easy
case! No combining rules needed or anything.
Try to add this text to a forum post:  (I hope the mail SW can deal with it...)
  "Have 🍷 for FPC 💓 Lazarus."

Now the fact is that code made with FPC / Lazarus using the LazUnicode
functions and enumerators supports Unicode already much better than
most code out there!

Juha



I think that there is a degree of confusion about the use of ligatures. 
Ligatures (at least in English) are typographical elements, not language 
elements. Not all typefaces support them and the code for a ligature 
should never appear in the source text. It is the function of the 
display software to combine adjacent characters and display the 
appropriate ligature if and only if the font that is used supports them.


A proportional typeface may display the character sequence 'fl' by using 
the appropriate ligature glyph. A monospaced typeface would display the 
same sequence as two characters, as would any typeface that did not 
include the ligature glyphs.


Ligatures improve the appearance of text but are strictly a display 
function and shouldn't actually appear in the text itself. This may not 
be true for other writing systems and other languages but is certainly 
true for English and perhaps other western European languages as well.


--
   TRUTH in her dress finds facts too tight.
   In fiction she moves with ease.
   Stray Birds by Rabindranath Tagore
--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-24 Thread Michael Schnell via Lazarus

On 24.10.2016 15:09, Mattias Gaertner via Lazarus wrote:
These functions exist. 
This of course is great (while the lack of documentation supposedly 
makes them hard to use).


In fact I am not asking, but the question is part of the OP's problem. 
And here I wanted to point out the ambiguity of the term "identical" on 
that behalf.


-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-24 Thread Mattias Gaertner via Lazarus
On Mon, 24 Oct 2016 14:35:28 +0200
Michael Schnell  wrote:

>[...] but even trying to find out a very short information is identical is not 
decently possible.
>[...]
> I meant to point out exactly this ambiguity:
> 
> identically coded vs. identically looking (e.g. combining codepoints), 
> vs identical presumed letters if looking differently (ligatures), ...

About "identically coded":
That is "decently possible" - simple string/byte comparison.

About "identically looking":
I guess you mean composed vs decomposed form. That is converting normal
forms. There are functions to normalize, but the information is
scattered and it would be nice if someone would write a page.

About "ligatures":
I guess you mean "collation". Same problem. Needs better documentation.

Basically you are asking for various compare and normalization
functions. These functions exist.

Mattias
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-24 Thread Michael Schnell via Lazarus

On 24.10.2016 13:34, Mattias Gaertner via Lazarus wrote:
That depends on what you mean with "identical". 
You are absolutely right. Very sorry for being critical while being 
vague myself (again typing faster than thinking) ;) .


I meant to point out exactly this ambiguity:

identically coded vs. identically looking (e.g. combining codepoints), 
vs identical presumed letters if looking differently (ligatures), ...


-Michael


--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-24 Thread Mattias Gaertner via Lazarus
On Mon, 24 Oct 2016 12:53:31 +0200
Michael Schnell via Lazarus  wrote:

> On 23.10.2016 11:31, Jürgen Hestermann via Lazarus wrote:
> >
> > But Unicode should have cared.
> > It was made for its use on computers.  
> I don't think so.
> 
> I suppose it was defined top allow for printing out digital documents in 
> mind, but not with working with them.

Non sense. The various normal forms aren't needed for printing, but for
"working with them". Same for the various encodings like UTF-8 and
UTF-16.
Think about the other type systems with diacritics like TeX.
That is made for printing documents, not for working with them.


> At least this i what the outcome suggests: printing works just fine, but 
> even trying to find out a very short information is identical is not 
> decently possible.

That depends on what you mean with "identical".
I guess you mean the topic "collation". It would be nice if someone
with some knowledge about that topic could start a wiki page or
fpdoc topic to list the common functions for them.

Mattias
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-24 Thread Michael Schnell via Lazarus

On 23.10.2016 11:31, Jürgen Hestermann via Lazarus wrote:


But Unicode should have cared.
It was made for its use on computers.

I don't think so.

I suppose it was defined top allow for printing out digital documents in 
mind, but not with working with them.


At least this i what the outcome suggests: printing works just fine, but 
even trying to find out a very short information is identical is not 
decently possible.


-Michael
--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-24 Thread Michael Schnell via Lazarus

On 21.10.2016 12:05, Gabor Boros via Lazarus wrote:


2016. 10. 21. 10:25 keltezéssel, Juha Manninen via Lazarus írta:

* Please read the wiki page ...


I read, I read but if contains buggy example... ;-)

I need a quick and a rock solid solution.
AFAIK, the only decent advice is never to use the numbers in Pos() / 
Length() / Copy() / Delete() for anything else tan with these functions. 
don't try to do any interpretation of these numbers. Never use the term 
"Character".


-Michael
--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-23 Thread Mattias Gaertner via Lazarus
On Sun, 23 Oct 2016 11:31:00 +0200
Jürgen Hestermann via Lazarus  wrote:

> Am 2016-10-22 um 22:38 schrieb Mattias Gaertner via Lazarus:
>  > Languages don't care about programmers.  
> 
> True.
> But Unicode should have cared.
> It was made for its use on computers.
> Pressing each and every language peculiarity
> into Unicode was a mistake and
> made Unicode so hard to use.

No one forces you to consider every "language peculiarity".


Mattias
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-23 Thread Jürgen Hestermann via Lazarus

Am 2016-10-22 um 22:38 schrieb Mattias Gaertner via Lazarus:
> Languages don't care about programmers.

True.
But Unicode should have cared.
It was made for its use on computers.
Pressing each and every language peculiarity
into Unicode was a mistake and
made Unicode so hard to use.

--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-22 Thread Mattias Gaertner via Lazarus
On Sat, 22 Oct 2016 13:25:30 +0300
Juha Manninen via Lazarus  wrote:

>[...]
> I guess the biggest complexity is in glyphs and ligatures. I still
> don't understand their details.

There is nothing to understand. Some languages have irregular letters.
Same as English has irregular verbs. You don't "understand" them, you
simply learn them.
As a programmer you don't need to learn them, but you should be aware
that many languages can't be mapped to simple arrays of characters.


> However for a program that must care about Unicode, like a text layout
> app, the rules for combining codepoints and glyphs are equally
> important. Codepoints for one glyph should never be split or copied
> separately. Isn't it so?

"Never" is wrong here. For example some editors allow to select the single 
letters of a
ligature. Also when comparing words you may want to ignore the
diacritical signs using the decomposed form of Unicode.
But afaik you are right that most programs never have an issue with
ligatures.
Btw, we need a wiki page about collation.


>[...]
> Despite problems and incompleteness of our Unicode support, it is
> actually better than most other solutions out there.
> Ok, most programming tools support Unicode somehow but people use them wrong.
> A good example is our forum SMF software. It deals with text layout
> and definitely should handle Unicode but it does not.
> Not even single Codepoints beyond BMP which should be the most easy
> case! No combining rules needed or anything.

Yes, that is basic Unicode encoding. No ligatures, no bidi. I agree
that this is the minimum for supporting Unicode.
Synedit goes much further.
And the native widgets often have pretty good support for the language
of the user. So the LCL controls using native widgets have
automatically good Unicode support.

Mattias
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-22 Thread Mattias Gaertner via Lazarus
On Sat, 22 Oct 2016 12:13:04 +0200
Jürgen Hestermann via Lazarus  wrote:

> Am 2016-10-22 um 10:53 schrieb Mattias Gaertner via Lazarus:
>  > Maybe you mean ligatures? Many languages have them, even German:
>  > https://en.wikipedia.org/wiki/Typographic_ligature  
> 
> I thought that ligatures are just a matter of the font
> but not the unicode representation?
> When I write a text which contains the two letters "fi"
> they should be two separate characters in my unicode string
> no matter whether they will be printed as a ligature  on the printer or 
> screen.
> So ligatures should not influence string encoding in FPC.
> Or am I missing something here?

Ligatures are a group of different issues.
The "fi" ligature is a "stylistic ligature", aka just a font
issue and as such is always represented by the two Unicode codepoints.

The wiki page describes various other types of ligatures, where the
Unicode representation can vary.

Languages don't care about programmers.


Mattias
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-22 Thread Juha Manninen via Lazarus
On Sat, Oct 22, 2016 at 1:13 PM, Jürgen Hestermann via Lazarus
 wrote:
> So ligatures should not influence string encoding in FPC.
> Or am I missing something here?

I guess it matters for a text layout software. It should not separate
the two characters forming a ligature.
I admit I don't know the issue, figuring out the details myself.

Juha
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-22 Thread Juha Manninen via Lazarus
On Sat, Oct 22, 2016 at 4:12 AM, Martin Frb via Lazarus
 wrote:
> Which ones does it not support?
> When I added it to SynEdit it was complete. It had all the combinings that
> the utf8 standard had back then. (at least that I could find in the
> documentation)
>
> Of course if a new combining range is added, it will not contain it. If that
> is needed one needs an external (OS or otherwise) library, that can/will be
> updated on those occasions.
>
> Mind "combining codepoints" have nothing to do with how many codepoints will
> be represented by one glyph.

Ok, I was confusing the Unicode terms again.
I guess the biggest complexity is in glyphs and ligatures. I still
don't understand their details.
However for a program that must care about Unicode, like a text layout
app, the rules for combining codepoints and glyphs are equally
important. Codepoints for one glyph should never be split or copied
separately. Isn't it so?
SynEdit is a text layout app, too.
In that sense the function IsCombining is not enough for practical
purposes. A comprehensive library function should take care of glyphs
(+ other rules), too.

I looked at Bero's PUCU and the other links:
 
http://forum.lazarus.freepascal.org/index.php/topic,33064.msg214342.html#msg214342
but it went over my head. I must study the issue more later.

* A reality check! *
Despite problems and incompleteness of our Unicode support, it is
actually better than most other solutions out there.
Ok, most programming tools support Unicode somehow but people use them wrong.
A good example is our forum SMF software. It deals with text layout
and definitely should handle Unicode but it does not.
Not even single Codepoints beyond BMP which should be the most easy
case! No combining rules needed or anything.
Try to add this text to a forum post:  (I hope the mail SW can deal with it...)
  "Have 🍷 for FPC 💓 Lazarus."

Now the fact is that code made with FPC / Lazarus using the LazUnicode
functions and enumerators supports Unicode already much better than
most code out there!

Juha
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-22 Thread Jürgen Hestermann via Lazarus

Am 2016-10-22 um 10:53 schrieb Mattias Gaertner via Lazarus:
> Maybe you mean ligatures? Many languages have them, even German:
> https://en.wikipedia.org/wiki/Typographic_ligature

I thought that ligatures are just a matter of the font
but not the unicode representation?
When I write a text which contains the two letters "fi"
they should be two separate characters in my unicode string
no matter whether they will be printed as a ligature  on the printer or screen.
So ligatures should not influence string encoding in FPC.
Or am I missing something here?
--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-22 Thread Mattias Gaertner via Lazarus
On Sat, 22 Oct 2016 02:12:34 +0100
Martin Frb via Lazarus  wrote:

>[...]
> It is my understanding (but I do not know for sure) that in some 
> languages (such as Arabic) certain letter combinations form a single 
> glyph (afaik/google see https://en.wikipedia.org/wiki/Hamzah combined 
> with a letter). Though maybe it is considered 2 glyph? I do not know 
> Arabic at all.

Maybe you mean ligatures? Many languages have them, even German:
https://en.wikipedia.org/wiki/Typographic_ligature

Scary: It may depend on the font what letters are combined to a
ligature. Even English can have them.


Mattias
 
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-21 Thread Martin Frb via Lazarus

On 21/10/2016 22:16, Juha Manninen via Lazarus wrote:

UTF-16. It does not support all the complex rules of combining
CodePoints, but it apparently works well for accented characters in
western languages.



Which ones does it not support?
When I added it to SynEdit it was complete. It had all the combinings 
that the utf8 standard had back then. (at least that I could find in the 
documentation)


Of course if a new combining range is added, it will not contain it. If 
that is needed one needs an external (OS or otherwise) library, that 
can/will be updated on those occasions.


Mind "combining codepoints" have nothing to do with how many codepoints 
will be represented by one glyph.


"â" is one character. But it can be a single codepoint (in utf16 one 
code-unit or word // in utf8 several code-unit or byte), or 2 codepoints 
("a" + combining "^").

"fi" are 2 chars. But the may be 2 or 1 glyph (ligature)

It is my understanding (but I do not know for sure) that in some 
languages (such as Arabic) certain letter combinations form a single 
glyph (afaik/google see https://en.wikipedia.org/wiki/Hamzah combined 
with a letter). Though maybe it is considered 2 glyph? I do not know 
Arabic at all.
Also in some scripts glyphs  are displayed in an order different from 
their occurrence in the text.
All of this however has nothing to do with combining codepoints, or what 
counts a character.


--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-21 Thread Juha Manninen via Lazarus
On Fri, Oct 21, 2016 at 2:26 PM, Juha Manninen
 wrote:
> No, neither FPC nor Lazarus have library code to deal with [combined 
> CodePoints] yet.
> The goal is to have an enumerator for user perceived characters, just
> like LazUnicode unit has for encoding agnostic CodePoints.

Sorry, that was not accurate.
Unit LazUnicode already has TUnicodeCharacterEnumerator which is able
to iterate combined accented Unicode characters.
It calls either function UTF8IsCombining or UTF16IsCombining depending
on the default encoding in use. Yes, Delphi and UTF-16 are supported.
The code was basically copied from SynEdit and then ported also to
UTF-16. It does not support all the complex rules of combining
CodePoints, but it apparently works well for accented characters in
western languages.

This:
 operator Enumerator(A: String): TUnicodeCharacterEnumerator;
would enable it for the for-in loop, but it is commented out now. The
current for-in loop enumerator works with CodePoints.

There is a test project in components/lazutils/test/LazUnicodeTest.lpi.
It includes combining CodePoints, too. Please take a look if you are interested.

Juha
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-21 Thread Juha Manninen via Lazarus
On Fri, Oct 21, 2016 at 5:08 PM, Jürgen Hestermann via Lazarus
 wrote:
> And again we are at the point where you need to understand what goes on
> under the hood... ;-)

Yes but that is true with any programming.
I am truly happy that we have Unicode instead of the old system
codepages. I remember text full of question marks earlier a lot but
not any more. Things are getting better...
I don't even know how the codepages worked when one text had many
languages. I don't even care now because we have Unicode. :)


On Fri, Oct 21, 2016 at 5:15 PM, Jürgen Hestermann via Lazarus
 wrote:
> The problem is, that Unicode has a code point for "á" but
> also allows to compose this characters by having an "a"
> and an "´" printed over each over.
> I will never understand why this was allowed because
> I thought that Unicode was intruduced to overcome such
> issues by defining a huge number of code points directly.
>
> Nevertheless, if you have such a situation then you cannot
> search for a byte sequence as there are 2 possible representations
> of the same character.

That is all true although Gabor's problem was not caused by it.
His LCL app used the default UTF-8 strings but the console program
used Windows codepage.
Adding to the confusion, Windows console codepage is different from
its system codepage (if I have understood right). This is another
reason to use the default UTF-8 system, it handles it all behind the
scenes.

> I have given up on taking care about such composed characters
> and assume that all Unicode strings are normalized.

I have understood the composed version (many codepoints / character)
is the recommended normalized one.
We must support it properly in future.
The combining rules are extremely complex. Benjamin Rosseaux (BeRo in
forum) has code for it. There was some other code, too. I must dive
into it sometime in future.

In fact we have simple code for combined accented characters in
LazUnicode unit, despite of what I wrote earlier in this thread.
It was basically copied from SynEdit. I will write another post...

Juha
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-21 Thread Jürgen Hestermann via Lazarus

Am 2016-10-21 um 13:23 schrieb Gabor Boros via Lazarus:
> I will know if somebody describe what a difference between á and an á 
characters in two points of my program.

The problem is, that Unicode has a code point for "á" but
also allows to compose this characters by having an "a"
and an "´" printed over each over.
I will never understand why this was allowed because
I thought that Unicode was intruduced to overcome such
issues by defining a huge number of code points directly.

Nevertheless, if you have such a situation then you cannot
search for a byte sequence as there are 2 possible representations
of the same character.

I have given up on taking care about such composed characters
and assume that all Unicode strings are normalized.

--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-21 Thread Jürgen Hestermann via Lazarus

Am 2016-10-21 um 14:59 schrieb Juha Manninen via Lazarus:
> On Fri, Oct 21, 2016 at 3:24 PM, Gabor Boros via Lazarus
>  wrote:
>> Why the below example better than a for loop with UTF8Length and UTF8Copy
>> for go through the string?
> Because it is MUCH faster. It scales linearly, O(n).
> Calling UTF8Length() and UTF8Copy() inside the loop makes it
> polynomial O(n^2) or worse depending on how many UTF8...() calls you
> have there.

And again we are at the point where you need to understand what goes on under 
the hood... ;-)


--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-21 Thread Juha Manninen via Lazarus
On Fri, Oct 21, 2016 at 3:24 PM, Gabor Boros via Lazarus
 wrote:
> Why the below example better than a for loop with UTF8Length and UTF8Copy
> for go through the string?

Because it is MUCH faster. It scales linearly, O(n).
Calling UTF8Length() and UTF8Copy() inside the loop makes it
polynomial O(n^2) or worse depending on how many UTF8...() calls you
have there.

Yes, we have seen complaints that UTF-8 is unusable because you must
use the slow UTF8Length() and UTF8Copy(), and UTF-16 is better because
you can use fixed width S[i] indexing.
That is obviously based on misunderstanding of both encodings.

Hint: if you need to iterate CodePoints, you can also use the
enumerator from LazUnicode unit. It uses the same concept as the
example in wiki page. It allows this code:

  for ch in s do
writeln('ch=',ch);

and the same code even works in Delphi with UTF-16. Cool, ha!?

Juha
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-21 Thread Gabor Boros via Lazarus

2016. 10. 21. 12:38 keltezéssel, Juha Manninen via Lazarus írta:


for i:=1 to UTF8Length(s) do Write(UTF8Copy(s,i,1))


No, it not a good solution!
I predict that most your code can still use byte indexing. At some
point you will get a Heureka-moment like "hey, I don't need the
codepoint index when doing this task!".


Juha, thank you for your patience and sorry if I am a completely idiot 
but... :-)


Why the below example better than a for loop with UTF8Length and 
UTF8Copy for go through the string?


I hope my "Heureka-moment" coming shortly! :D

http://wiki.freepascal.org/UTF8_strings_and_characters#Iterating_over_string_analysing_individual_codepoints

Gabor
--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-21 Thread Juha Manninen via Lazarus
On Fri, Oct 21, 2016 at 2:13 PM, Gabor Boros via Lazarus
 wrote:
> Same FCP same Lazarus. Why is there a difference in the result?

You still did not read the wiki page:
 http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus
Console programs are mentioned in many places. This is under "Usage in Lazarus":
 "For console programs (no LCL) a dependency for LazUtils must be
added manually.
  LCL applications already have it through the LCL dependency."

Juha
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-21 Thread Juha Manninen via Lazarus
On Fri, Oct 21, 2016 at 12:51 PM, Lars via Lazarus
 wrote:
> Indeed this is a serious problem these days, unicode.. which is almost a 
> virus.
> In GoLang they use something called "Runes" to try and solve the problem.

I had to search about what "runes" in GoLang mean. I found:
---
"Code point" is a bit of a mouthful, so Go introduces a shorter term
for the concept: rune. The term appears in the libraries and source
code, and means exactly the same as "code point", with one interesting
addition.
The Go language defines the word rune as an alias for the type int32,
so programs can be clear when an integer value represents a code
point.
---
So it is a new name for CodePoint. Great. It does not sound very
useful to me. I hope they don't do something as stupid as Python 3
does, converting all string data internally to UTF-32.


> Off topic but I wonder if Lazarus/fpc uses something anything
> similar to golang's rune's approach or looked into it.

Yes but we call it "CodePoint" like rest of the world does.
CodePoints are the easy part of Unicode, regardless of encoding!
Look at the examples here:
 http://wiki.freepascal.org/UTF8_strings_and_characters
They can handle pretty much any use case dealing with CodePoints. It
is not difficult. It is easy.

Your worries about complexity of Unicode are valid but the reason is
combining CodePoints into user perceived characters. The rules are
complex, there is normalization and its associated problems etc.
No, neither FPC nor Lazarus have library code to deal with that yet.
The goal is to have an enumerator for user perceived characters, just
like LazUnicode unit has for encoding agnostic CodePoints.

Juha
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-21 Thread Gabor Boros via Lazarus

2016. 10. 21. 12:48 keltezéssel, Jürgen Hestermann via Lazarus írta:

If you realy need the character position/length, then
you have to use UTF8Length/UTF8Copy/etc.
But as said: It is only needed in special circumstances.
Still you have to know when to use what.


I will know if somebody describe what a difference between á and an á 
characters in two points of my program. I answered to Juha's reply with 
two short examples. Don't understand why an á character/string different 
for example from a ReadLn and from Edit1.Text.


Gabor
--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-21 Thread Gabor Boros via Lazarus

2016. 10. 21. 12:38 keltezéssel, Juha Manninen via Lazarus írta:

I do not want to think of where Length, Copy, Delete is good and where UTF8*
needed.


Well, you must think when coding. There is no shortcut. :)

BTW, if you are worried about Delphi compatibility there is now unit
LazUnicode available.


Delphi compatibility not needed for me, but I am a silly coder and don't 
understand why the wiki say you can use Length, Copy, Delete with UTF8. 
See two examples below. First is a Lazarus project with a simple 
editbox, if press á (Alt+160) the form caption show 2. The second 
example is a console project (in Lazarus also), if press á (Alt+160) 
then Enter see 1 as result. Same FCP same Lazarus. Why is there a 
difference in the result?


1.
procedure TForm1.Edit1Change(Sender: TObject);
begin
  Caption:=IntToStr(Length(Edit1.Text));
end;

2.
var
  s:string;

begin
  ReadLn(s);
  Write(Length(s));
  ReadLn;
end.

Gabor
--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-21 Thread Jürgen Hestermann via Lazarus

Am 2016-10-21 um 12:05 schrieb Gabor Boros via Lazarus:
> 2016. 10. 21. 10:25 keltezéssel, Juha Manninen via Lazarus írta:
>> * Please read the wiki page ...
> I read, I read but if contains buggy example... ;-)

Yes, this can be very frustrating...
Documenation is one of the major drawbacks of Free Pascal/Lazarus.


> I need a quick and a rock solid solution. Is it good solution if replace all 
Length, Copy, Delete with UTF8Length, UTF8Copy, UTF8Delete and read the strings 
through with this for i:=1 to UTF8Length(s) do Write(UTF8Copy(s,i,1))?
> I do not want to think of where Length, Copy, Delete is good and where UTF8* 
needed.

I think if you want to use unicode (which is IMO unavoidable today)
then UTF8 is a good choice (see http://utf8everywhere.org )
and then you have to cope with the encoding anyway.
Byte and character position are not related anymore,
neither in UTF-8 nor in UTF-16.
Only UTF-32 provides this but wastes a lot of memory.

But in many cases you do not need the character position.
To find a substring, you only need the byte position.
You can then delete this character from the byte position
and insert another one. Of course, you need to delete as many
bytes as the character consists of.

If you realy need the character position/length, then
you have to use UTF8Length/UTF8Copy/etc.
But as said: It is only needed in special circumstances.
Still you have to know when to use what.
--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-21 Thread Juha Manninen via Lazarus
On Fri, Oct 21, 2016 at 1:05 PM, Gabor Boros via Lazarus
 wrote:
> 2016. 10. 21. 10:25 keltezéssel, Juha Manninen via Lazarus írta:
> I read, I read but if contains buggy example... ;-)

Mattias fixed the bug.

> I need a quick and a rock solid solution. Is it good solution if replace all
> Length, Copy, Delete with UTF8Length, UTF8Copy, UTF8Delete and read the
> strings through with this for i:=1 to UTF8Length(s) do
> Write(UTF8Copy(s,i,1))?

No, it not a good solution!
I predict that most your code can still use byte indexing. At some
point you will get a Heureka-moment like "hey, I don't need the
codepoint index when doing this task!".

> I do not want to think of where Length, Copy, Delete is good and where UTF8*
> needed.

Well, you must think when coding. There is no shortcut. :)

BTW, if you are worried about Delphi compatibility there is now unit
LazUnicode available.
See:
 
http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#CodePoint_functions_for_encoding_agnostic_code

Juha
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-21 Thread Gabor Boros via Lazarus

2016. 10. 21. 10:25 keltezéssel, Juha Manninen via Lazarus írta:

* Please read the wiki page ...


I read, I read but if contains buggy example... ;-)

I need a quick and a rock solid solution. Is it good solution if replace 
all Length, Copy, Delete with UTF8Length, UTF8Copy, UTF8Delete and read 
the strings through with this for i:=1 to UTF8Length(s) do 
Write(UTF8Copy(s,i,1))?


I do not want to think of where Length, Copy, Delete is good and where 
UTF8* needed.


Gabor


--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-21 Thread Lars via Lazarus
On Fri, October 21, 2016 1:03 am, Gabor Boros via Lazarus wrote:
> Hi All,
>
>
> In the past I used Length, Pos, Delete, for i:=1 to Length(s) do s[i]...
> and realized yesterday these practices are wrong. But I do not know what
> the right practice.

Indeed this is a serious problem these days, unicode.. which is almost a
virus. In GoLang they use something called "Runes" to try and solve the
problem.  Off topic but I wonder if Lazarus/fpc uses something anything
similar to golang's rune's approach or looked into it.

IMO unicode reaches something like Godel's incompleteness problem. You can
never actually prove that a unicode program will work properly nor prove
that it won't have bugs, because unicode creates infinite gotchyas and
unicode is always evolving to have more characters that you didn't know
about before.

It makes code inelegant compared to plain english 255 systems like in the
1970's.

There is an interesting article/video about it on Sucksless, and even this
guy scares me when he talks about unicode even though he is trying to fix
the problems:
"UTF-8 everywhere? Writing Unicode compliant software that sucks less,
Laslo Hunhold"

But it of course is not specific to Lazarus. Sorry for slightly off topic.
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-21 Thread Mattias Gaertner via Lazarus
On Fri, 21 Oct 2016 11:29:36 +0200
Gabor Boros via Lazarus  wrote:

> 2016. 10. 21. 10:24 keltezéssel, Juha Manninen via Lazarus írta:
> > A "character" in Unicode is an ambiguous term.
> > Often the good old byte (codeunit) access is very useful.
> > See:
> >  http://wiki.freepascal.org/UTF8_strings_and_characters  
> 
> I started with the wiki pages, but 2 about UTF8 in english is too much 
> for me and you pointed to a 3rd... :-)
> 
> On the above link at "Searching a substring" I read "Due to the special 
> nature of UTF8 you can simply use the normal string functions for 
> searching a sub-string.". But the example Where procedure returns this 
> for me: "The substring "á" is in the text "éáó" at byte position 3 and 
> at character position 1". Which incorrect because á is the 2nd character.

Thanks for the hint. I fixed it.

Mattias

-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-21 Thread Gabor Boros via Lazarus

2016. 10. 21. 10:24 keltezéssel, Juha Manninen via Lazarus írta:

A "character" in Unicode is an ambiguous term.
Often the good old byte (codeunit) access is very useful.
See:
 http://wiki.freepascal.org/UTF8_strings_and_characters


I started with the wiki pages, but 2 about UTF8 in english is too much 
for me and you pointed to a 3rd... :-)


On the above link at "Searching a substring" I read "Due to the special 
nature of UTF8 you can simply use the normal string functions for 
searching a sub-string.". But the example Where procedure returns this 
for me: "The substring "á" is in the text "éáó" at byte position 3 and 
at character position 1". Which incorrect because á is the 2nd character.


Gabor
--
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-21 Thread Juha Manninen via Lazarus
* Please read the wiki page ...
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?

2016-10-21 Thread Juha Manninen via Lazarus
On Fri, Oct 21, 2016 at 10:03 AM, Gabor Boros via Lazarus
 wrote:
> UTF8* is good to me but a compiler directive is easier to use, just don't
> know why not working properly.

Please the wiki page you found. It is explained there.
 http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#String_Literals


> What is the proper way to read through (character after character) the
> string if use UTF8* procedures?

A "character" in Unicode is an ambiguous term.
Often the good old byte (codeunit) access is very useful.
See:
 http://wiki.freepascal.org/UTF8_strings_and_characters

Juha
-- 
___
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
http://lists.lazarus-ide.org/listinfo/lazarus