Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On 10/30/2016 06:15 PM, Lars wrote: Off topic, off list, but.. -- TRUTH in her dress finds facts too tight. In fiction she moves with ease. Stray Birds by Rabindranath Tagore What does this mean? Something about this: http://www.the-niceguy.com/articles/Nutballs.html The only thing I can think of is a woman wearing a dress, that is tight, and is incapable of processing facts. I double take on complicated english quotations. Yah, it's way off topic but I'll bite just this once. It's poetry. It's not to be taken literally. If your native language is not English, it may give you some problems. For the record, Rabindranath Tagore was an Indian writer. The quote is from his book "Stray Birds". What this couplet is saying is that sometimes it is easier to use a fictional account to express something true. As I've said it is poetry, these are metaphors, figures of speech. I find much of Tagore's language very beautiful. We've chosen another Tagore "Stray Bird" as the epitaph on our grave marker. Here are a couple more from "Stray Birds": The mind, sharp but not broad, sticks at every point but does not move. and A mind all logic is like a knife all blade. It makes the hand bleed that uses it. There are many more... -- TRUTH in her dress finds facts too tight. In fiction she moves with ease. Stray Birds by Rabindranath Tagore -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On 10/22/2016 06:25 AM, Juha Manninen via Lazarus wrote: On Sat, Oct 22, 2016 at 4:12 AM, Martin Frb via Lazarus wrote: Which ones does it not support? When I added it to SynEdit it was complete. It had all the combinings that the utf8 standard had back then. (at least that I could find in the documentation) Of course if a new combining range is added, it will not contain it. If that is needed one needs an external (OS or otherwise) library, that can/will be updated on those occasions. Mind "combining codepoints" have nothing to do with how many codepoints will be represented by one glyph. Ok, I was confusing the Unicode terms again. I guess the biggest complexity is in glyphs and ligatures. I still don't understand their details. However for a program that must care about Unicode, like a text layout app, the rules for combining codepoints and glyphs are equally important. Codepoints for one glyph should never be split or copied separately. Isn't it so? SynEdit is a text layout app, too. In that sense the function IsCombining is not enough for practical purposes. A comprehensive library function should take care of glyphs (+ other rules), too. I looked at Bero's PUCU and the other links: http://forum.lazarus.freepascal.org/index.php/topic,33064.msg214342.html#msg214342 but it went over my head. I must study the issue more later. * A reality check! * Despite problems and incompleteness of our Unicode support, it is actually better than most other solutions out there. Ok, most programming tools support Unicode somehow but people use them wrong. A good example is our forum SMF software. It deals with text layout and definitely should handle Unicode but it does not. Not even single Codepoints beyond BMP which should be the most easy case! No combining rules needed or anything. Try to add this text to a forum post: (I hope the mail SW can deal with it...) "Have 🍷 for FPC 💓 Lazarus." Now the fact is that code made with FPC / Lazarus using the LazUnicode functions and enumerators supports Unicode already much better than most code out there! Juha I think that there is a degree of confusion about the use of ligatures. Ligatures (at least in English) are typographical elements, not language elements. Not all typefaces support them and the code for a ligature should never appear in the source text. It is the function of the display software to combine adjacent characters and display the appropriate ligature if and only if the font that is used supports them. A proportional typeface may display the character sequence 'fl' by using the appropriate ligature glyph. A monospaced typeface would display the same sequence as two characters, as would any typeface that did not include the ligature glyphs. Ligatures improve the appearance of text but are strictly a display function and shouldn't actually appear in the text itself. This may not be true for other writing systems and other languages but is certainly true for English and perhaps other western European languages as well. -- TRUTH in her dress finds facts too tight. In fiction she moves with ease. Stray Birds by Rabindranath Tagore -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On 24.10.2016 15:09, Mattias Gaertner via Lazarus wrote: These functions exist. This of course is great (while the lack of documentation supposedly makes them hard to use). In fact I am not asking, but the question is part of the OP's problem. And here I wanted to point out the ambiguity of the term "identical" on that behalf. -Michael -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On Mon, 24 Oct 2016 14:35:28 +0200 Michael Schnell wrote: >[...] but even trying to find out a very short information is identical is not decently possible. >[...] > I meant to point out exactly this ambiguity: > > identically coded vs. identically looking (e.g. combining codepoints), > vs identical presumed letters if looking differently (ligatures), ... About "identically coded": That is "decently possible" - simple string/byte comparison. About "identically looking": I guess you mean composed vs decomposed form. That is converting normal forms. There are functions to normalize, but the information is scattered and it would be nice if someone would write a page. About "ligatures": I guess you mean "collation". Same problem. Needs better documentation. Basically you are asking for various compare and normalization functions. These functions exist. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On 24.10.2016 13:34, Mattias Gaertner via Lazarus wrote: That depends on what you mean with "identical". You are absolutely right. Very sorry for being critical while being vague myself (again typing faster than thinking) ;) . I meant to point out exactly this ambiguity: identically coded vs. identically looking (e.g. combining codepoints), vs identical presumed letters if looking differently (ligatures), ... -Michael -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On Mon, 24 Oct 2016 12:53:31 +0200 Michael Schnell via Lazarus wrote: > On 23.10.2016 11:31, Jürgen Hestermann via Lazarus wrote: > > > > But Unicode should have cared. > > It was made for its use on computers. > I don't think so. > > I suppose it was defined top allow for printing out digital documents in > mind, but not with working with them. Non sense. The various normal forms aren't needed for printing, but for "working with them". Same for the various encodings like UTF-8 and UTF-16. Think about the other type systems with diacritics like TeX. That is made for printing documents, not for working with them. > At least this i what the outcome suggests: printing works just fine, but > even trying to find out a very short information is identical is not > decently possible. That depends on what you mean with "identical". I guess you mean the topic "collation". It would be nice if someone with some knowledge about that topic could start a wiki page or fpdoc topic to list the common functions for them. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On 23.10.2016 11:31, Jürgen Hestermann via Lazarus wrote: But Unicode should have cared. It was made for its use on computers. I don't think so. I suppose it was defined top allow for printing out digital documents in mind, but not with working with them. At least this i what the outcome suggests: printing works just fine, but even trying to find out a very short information is identical is not decently possible. -Michael -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On 21.10.2016 12:05, Gabor Boros via Lazarus wrote: 2016. 10. 21. 10:25 keltezéssel, Juha Manninen via Lazarus írta: * Please read the wiki page ... I read, I read but if contains buggy example... ;-) I need a quick and a rock solid solution. AFAIK, the only decent advice is never to use the numbers in Pos() / Length() / Copy() / Delete() for anything else tan with these functions. don't try to do any interpretation of these numbers. Never use the term "Character". -Michael -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On Sun, 23 Oct 2016 11:31:00 +0200 Jürgen Hestermann via Lazarus wrote: > Am 2016-10-22 um 22:38 schrieb Mattias Gaertner via Lazarus: > > Languages don't care about programmers. > > True. > But Unicode should have cared. > It was made for its use on computers. > Pressing each and every language peculiarity > into Unicode was a mistake and > made Unicode so hard to use. No one forces you to consider every "language peculiarity". Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
Am 2016-10-22 um 22:38 schrieb Mattias Gaertner via Lazarus: > Languages don't care about programmers. True. But Unicode should have cared. It was made for its use on computers. Pressing each and every language peculiarity into Unicode was a mistake and made Unicode so hard to use. -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On Sat, 22 Oct 2016 13:25:30 +0300 Juha Manninen via Lazarus wrote: >[...] > I guess the biggest complexity is in glyphs and ligatures. I still > don't understand their details. There is nothing to understand. Some languages have irregular letters. Same as English has irregular verbs. You don't "understand" them, you simply learn them. As a programmer you don't need to learn them, but you should be aware that many languages can't be mapped to simple arrays of characters. > However for a program that must care about Unicode, like a text layout > app, the rules for combining codepoints and glyphs are equally > important. Codepoints for one glyph should never be split or copied > separately. Isn't it so? "Never" is wrong here. For example some editors allow to select the single letters of a ligature. Also when comparing words you may want to ignore the diacritical signs using the decomposed form of Unicode. But afaik you are right that most programs never have an issue with ligatures. Btw, we need a wiki page about collation. >[...] > Despite problems and incompleteness of our Unicode support, it is > actually better than most other solutions out there. > Ok, most programming tools support Unicode somehow but people use them wrong. > A good example is our forum SMF software. It deals with text layout > and definitely should handle Unicode but it does not. > Not even single Codepoints beyond BMP which should be the most easy > case! No combining rules needed or anything. Yes, that is basic Unicode encoding. No ligatures, no bidi. I agree that this is the minimum for supporting Unicode. Synedit goes much further. And the native widgets often have pretty good support for the language of the user. So the LCL controls using native widgets have automatically good Unicode support. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On Sat, 22 Oct 2016 12:13:04 +0200 Jürgen Hestermann via Lazarus wrote: > Am 2016-10-22 um 10:53 schrieb Mattias Gaertner via Lazarus: > > Maybe you mean ligatures? Many languages have them, even German: > > https://en.wikipedia.org/wiki/Typographic_ligature > > I thought that ligatures are just a matter of the font > but not the unicode representation? > When I write a text which contains the two letters "fi" > they should be two separate characters in my unicode string > no matter whether they will be printed as a ligature on the printer or > screen. > So ligatures should not influence string encoding in FPC. > Or am I missing something here? Ligatures are a group of different issues. The "fi" ligature is a "stylistic ligature", aka just a font issue and as such is always represented by the two Unicode codepoints. The wiki page describes various other types of ligatures, where the Unicode representation can vary. Languages don't care about programmers. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On Sat, Oct 22, 2016 at 1:13 PM, Jürgen Hestermann via Lazarus wrote: > So ligatures should not influence string encoding in FPC. > Or am I missing something here? I guess it matters for a text layout software. It should not separate the two characters forming a ligature. I admit I don't know the issue, figuring out the details myself. Juha -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On Sat, Oct 22, 2016 at 4:12 AM, Martin Frb via Lazarus wrote: > Which ones does it not support? > When I added it to SynEdit it was complete. It had all the combinings that > the utf8 standard had back then. (at least that I could find in the > documentation) > > Of course if a new combining range is added, it will not contain it. If that > is needed one needs an external (OS or otherwise) library, that can/will be > updated on those occasions. > > Mind "combining codepoints" have nothing to do with how many codepoints will > be represented by one glyph. Ok, I was confusing the Unicode terms again. I guess the biggest complexity is in glyphs and ligatures. I still don't understand their details. However for a program that must care about Unicode, like a text layout app, the rules for combining codepoints and glyphs are equally important. Codepoints for one glyph should never be split or copied separately. Isn't it so? SynEdit is a text layout app, too. In that sense the function IsCombining is not enough for practical purposes. A comprehensive library function should take care of glyphs (+ other rules), too. I looked at Bero's PUCU and the other links: http://forum.lazarus.freepascal.org/index.php/topic,33064.msg214342.html#msg214342 but it went over my head. I must study the issue more later. * A reality check! * Despite problems and incompleteness of our Unicode support, it is actually better than most other solutions out there. Ok, most programming tools support Unicode somehow but people use them wrong. A good example is our forum SMF software. It deals with text layout and definitely should handle Unicode but it does not. Not even single Codepoints beyond BMP which should be the most easy case! No combining rules needed or anything. Try to add this text to a forum post: (I hope the mail SW can deal with it...) "Have 🍷 for FPC 💓 Lazarus." Now the fact is that code made with FPC / Lazarus using the LazUnicode functions and enumerators supports Unicode already much better than most code out there! Juha -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
Am 2016-10-22 um 10:53 schrieb Mattias Gaertner via Lazarus: > Maybe you mean ligatures? Many languages have them, even German: > https://en.wikipedia.org/wiki/Typographic_ligature I thought that ligatures are just a matter of the font but not the unicode representation? When I write a text which contains the two letters "fi" they should be two separate characters in my unicode string no matter whether they will be printed as a ligature on the printer or screen. So ligatures should not influence string encoding in FPC. Or am I missing something here? -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On Sat, 22 Oct 2016 02:12:34 +0100 Martin Frb via Lazarus wrote: >[...] > It is my understanding (but I do not know for sure) that in some > languages (such as Arabic) certain letter combinations form a single > glyph (afaik/google see https://en.wikipedia.org/wiki/Hamzah combined > with a letter). Though maybe it is considered 2 glyph? I do not know > Arabic at all. Maybe you mean ligatures? Many languages have them, even German: https://en.wikipedia.org/wiki/Typographic_ligature Scary: It may depend on the font what letters are combined to a ligature. Even English can have them. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On 21/10/2016 22:16, Juha Manninen via Lazarus wrote: UTF-16. It does not support all the complex rules of combining CodePoints, but it apparently works well for accented characters in western languages. Which ones does it not support? When I added it to SynEdit it was complete. It had all the combinings that the utf8 standard had back then. (at least that I could find in the documentation) Of course if a new combining range is added, it will not contain it. If that is needed one needs an external (OS or otherwise) library, that can/will be updated on those occasions. Mind "combining codepoints" have nothing to do with how many codepoints will be represented by one glyph. "â" is one character. But it can be a single codepoint (in utf16 one code-unit or word // in utf8 several code-unit or byte), or 2 codepoints ("a" + combining "^"). "fi" are 2 chars. But the may be 2 or 1 glyph (ligature) It is my understanding (but I do not know for sure) that in some languages (such as Arabic) certain letter combinations form a single glyph (afaik/google see https://en.wikipedia.org/wiki/Hamzah combined with a letter). Though maybe it is considered 2 glyph? I do not know Arabic at all. Also in some scripts glyphs are displayed in an order different from their occurrence in the text. All of this however has nothing to do with combining codepoints, or what counts a character. -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On Fri, Oct 21, 2016 at 2:26 PM, Juha Manninen wrote: > No, neither FPC nor Lazarus have library code to deal with [combined > CodePoints] yet. > The goal is to have an enumerator for user perceived characters, just > like LazUnicode unit has for encoding agnostic CodePoints. Sorry, that was not accurate. Unit LazUnicode already has TUnicodeCharacterEnumerator which is able to iterate combined accented Unicode characters. It calls either function UTF8IsCombining or UTF16IsCombining depending on the default encoding in use. Yes, Delphi and UTF-16 are supported. The code was basically copied from SynEdit and then ported also to UTF-16. It does not support all the complex rules of combining CodePoints, but it apparently works well for accented characters in western languages. This: operator Enumerator(A: String): TUnicodeCharacterEnumerator; would enable it for the for-in loop, but it is commented out now. The current for-in loop enumerator works with CodePoints. There is a test project in components/lazutils/test/LazUnicodeTest.lpi. It includes combining CodePoints, too. Please take a look if you are interested. Juha -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On Fri, Oct 21, 2016 at 5:08 PM, Jürgen Hestermann via Lazarus wrote: > And again we are at the point where you need to understand what goes on > under the hood... ;-) Yes but that is true with any programming. I am truly happy that we have Unicode instead of the old system codepages. I remember text full of question marks earlier a lot but not any more. Things are getting better... I don't even know how the codepages worked when one text had many languages. I don't even care now because we have Unicode. :) On Fri, Oct 21, 2016 at 5:15 PM, Jürgen Hestermann via Lazarus wrote: > The problem is, that Unicode has a code point for "á" but > also allows to compose this characters by having an "a" > and an "´" printed over each over. > I will never understand why this was allowed because > I thought that Unicode was intruduced to overcome such > issues by defining a huge number of code points directly. > > Nevertheless, if you have such a situation then you cannot > search for a byte sequence as there are 2 possible representations > of the same character. That is all true although Gabor's problem was not caused by it. His LCL app used the default UTF-8 strings but the console program used Windows codepage. Adding to the confusion, Windows console codepage is different from its system codepage (if I have understood right). This is another reason to use the default UTF-8 system, it handles it all behind the scenes. > I have given up on taking care about such composed characters > and assume that all Unicode strings are normalized. I have understood the composed version (many codepoints / character) is the recommended normalized one. We must support it properly in future. The combining rules are extremely complex. Benjamin Rosseaux (BeRo in forum) has code for it. There was some other code, too. I must dive into it sometime in future. In fact we have simple code for combined accented characters in LazUnicode unit, despite of what I wrote earlier in this thread. It was basically copied from SynEdit. I will write another post... Juha -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
Am 2016-10-21 um 13:23 schrieb Gabor Boros via Lazarus: > I will know if somebody describe what a difference between á and an á characters in two points of my program. The problem is, that Unicode has a code point for "á" but also allows to compose this characters by having an "a" and an "´" printed over each over. I will never understand why this was allowed because I thought that Unicode was intruduced to overcome such issues by defining a huge number of code points directly. Nevertheless, if you have such a situation then you cannot search for a byte sequence as there are 2 possible representations of the same character. I have given up on taking care about such composed characters and assume that all Unicode strings are normalized. -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
Am 2016-10-21 um 14:59 schrieb Juha Manninen via Lazarus: > On Fri, Oct 21, 2016 at 3:24 PM, Gabor Boros via Lazarus > wrote: >> Why the below example better than a for loop with UTF8Length and UTF8Copy >> for go through the string? > Because it is MUCH faster. It scales linearly, O(n). > Calling UTF8Length() and UTF8Copy() inside the loop makes it > polynomial O(n^2) or worse depending on how many UTF8...() calls you > have there. And again we are at the point where you need to understand what goes on under the hood... ;-) -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On Fri, Oct 21, 2016 at 3:24 PM, Gabor Boros via Lazarus wrote: > Why the below example better than a for loop with UTF8Length and UTF8Copy > for go through the string? Because it is MUCH faster. It scales linearly, O(n). Calling UTF8Length() and UTF8Copy() inside the loop makes it polynomial O(n^2) or worse depending on how many UTF8...() calls you have there. Yes, we have seen complaints that UTF-8 is unusable because you must use the slow UTF8Length() and UTF8Copy(), and UTF-16 is better because you can use fixed width S[i] indexing. That is obviously based on misunderstanding of both encodings. Hint: if you need to iterate CodePoints, you can also use the enumerator from LazUnicode unit. It uses the same concept as the example in wiki page. It allows this code: for ch in s do writeln('ch=',ch); and the same code even works in Delphi with UTF-16. Cool, ha!? Juha -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
2016. 10. 21. 12:38 keltezéssel, Juha Manninen via Lazarus írta: for i:=1 to UTF8Length(s) do Write(UTF8Copy(s,i,1)) No, it not a good solution! I predict that most your code can still use byte indexing. At some point you will get a Heureka-moment like "hey, I don't need the codepoint index when doing this task!". Juha, thank you for your patience and sorry if I am a completely idiot but... :-) Why the below example better than a for loop with UTF8Length and UTF8Copy for go through the string? I hope my "Heureka-moment" coming shortly! :D http://wiki.freepascal.org/UTF8_strings_and_characters#Iterating_over_string_analysing_individual_codepoints Gabor -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On Fri, Oct 21, 2016 at 2:13 PM, Gabor Boros via Lazarus wrote: > Same FCP same Lazarus. Why is there a difference in the result? You still did not read the wiki page: http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus Console programs are mentioned in many places. This is under "Usage in Lazarus": "For console programs (no LCL) a dependency for LazUtils must be added manually. LCL applications already have it through the LCL dependency." Juha -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On Fri, Oct 21, 2016 at 12:51 PM, Lars via Lazarus wrote: > Indeed this is a serious problem these days, unicode.. which is almost a > virus. > In GoLang they use something called "Runes" to try and solve the problem. I had to search about what "runes" in GoLang mean. I found: --- "Code point" is a bit of a mouthful, so Go introduces a shorter term for the concept: rune. The term appears in the libraries and source code, and means exactly the same as "code point", with one interesting addition. The Go language defines the word rune as an alias for the type int32, so programs can be clear when an integer value represents a code point. --- So it is a new name for CodePoint. Great. It does not sound very useful to me. I hope they don't do something as stupid as Python 3 does, converting all string data internally to UTF-32. > Off topic but I wonder if Lazarus/fpc uses something anything > similar to golang's rune's approach or looked into it. Yes but we call it "CodePoint" like rest of the world does. CodePoints are the easy part of Unicode, regardless of encoding! Look at the examples here: http://wiki.freepascal.org/UTF8_strings_and_characters They can handle pretty much any use case dealing with CodePoints. It is not difficult. It is easy. Your worries about complexity of Unicode are valid but the reason is combining CodePoints into user perceived characters. The rules are complex, there is normalization and its associated problems etc. No, neither FPC nor Lazarus have library code to deal with that yet. The goal is to have an enumerator for user perceived characters, just like LazUnicode unit has for encoding agnostic CodePoints. Juha -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
2016. 10. 21. 12:48 keltezéssel, Jürgen Hestermann via Lazarus írta: If you realy need the character position/length, then you have to use UTF8Length/UTF8Copy/etc. But as said: It is only needed in special circumstances. Still you have to know when to use what. I will know if somebody describe what a difference between á and an á characters in two points of my program. I answered to Juha's reply with two short examples. Don't understand why an á character/string different for example from a ReadLn and from Edit1.Text. Gabor -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
2016. 10. 21. 12:38 keltezéssel, Juha Manninen via Lazarus írta: I do not want to think of where Length, Copy, Delete is good and where UTF8* needed. Well, you must think when coding. There is no shortcut. :) BTW, if you are worried about Delphi compatibility there is now unit LazUnicode available. Delphi compatibility not needed for me, but I am a silly coder and don't understand why the wiki say you can use Length, Copy, Delete with UTF8. See two examples below. First is a Lazarus project with a simple editbox, if press á (Alt+160) the form caption show 2. The second example is a console project (in Lazarus also), if press á (Alt+160) then Enter see 1 as result. Same FCP same Lazarus. Why is there a difference in the result? 1. procedure TForm1.Edit1Change(Sender: TObject); begin Caption:=IntToStr(Length(Edit1.Text)); end; 2. var s:string; begin ReadLn(s); Write(Length(s)); ReadLn; end. Gabor -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
Am 2016-10-21 um 12:05 schrieb Gabor Boros via Lazarus: > 2016. 10. 21. 10:25 keltezéssel, Juha Manninen via Lazarus írta: >> * Please read the wiki page ... > I read, I read but if contains buggy example... ;-) Yes, this can be very frustrating... Documenation is one of the major drawbacks of Free Pascal/Lazarus. > I need a quick and a rock solid solution. Is it good solution if replace all Length, Copy, Delete with UTF8Length, UTF8Copy, UTF8Delete and read the strings through with this for i:=1 to UTF8Length(s) do Write(UTF8Copy(s,i,1))? > I do not want to think of where Length, Copy, Delete is good and where UTF8* needed. I think if you want to use unicode (which is IMO unavoidable today) then UTF8 is a good choice (see http://utf8everywhere.org ) and then you have to cope with the encoding anyway. Byte and character position are not related anymore, neither in UTF-8 nor in UTF-16. Only UTF-32 provides this but wastes a lot of memory. But in many cases you do not need the character position. To find a substring, you only need the byte position. You can then delete this character from the byte position and insert another one. Of course, you need to delete as many bytes as the character consists of. If you realy need the character position/length, then you have to use UTF8Length/UTF8Copy/etc. But as said: It is only needed in special circumstances. Still you have to know when to use what. -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On Fri, Oct 21, 2016 at 1:05 PM, Gabor Boros via Lazarus wrote: > 2016. 10. 21. 10:25 keltezéssel, Juha Manninen via Lazarus írta: > I read, I read but if contains buggy example... ;-) Mattias fixed the bug. > I need a quick and a rock solid solution. Is it good solution if replace all > Length, Copy, Delete with UTF8Length, UTF8Copy, UTF8Delete and read the > strings through with this for i:=1 to UTF8Length(s) do > Write(UTF8Copy(s,i,1))? No, it not a good solution! I predict that most your code can still use byte indexing. At some point you will get a Heureka-moment like "hey, I don't need the codepoint index when doing this task!". > I do not want to think of where Length, Copy, Delete is good and where UTF8* > needed. Well, you must think when coding. There is no shortcut. :) BTW, if you are worried about Delphi compatibility there is now unit LazUnicode available. See: http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#CodePoint_functions_for_encoding_agnostic_code Juha -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
2016. 10. 21. 10:25 keltezéssel, Juha Manninen via Lazarus írta: * Please read the wiki page ... I read, I read but if contains buggy example... ;-) I need a quick and a rock solid solution. Is it good solution if replace all Length, Copy, Delete with UTF8Length, UTF8Copy, UTF8Delete and read the strings through with this for i:=1 to UTF8Length(s) do Write(UTF8Copy(s,i,1))? I do not want to think of where Length, Copy, Delete is good and where UTF8* needed. Gabor -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On Fri, October 21, 2016 1:03 am, Gabor Boros via Lazarus wrote: > Hi All, > > > In the past I used Length, Pos, Delete, for i:=1 to Length(s) do s[i]... > and realized yesterday these practices are wrong. But I do not know what > the right practice. Indeed this is a serious problem these days, unicode.. which is almost a virus. In GoLang they use something called "Runes" to try and solve the problem. Off topic but I wonder if Lazarus/fpc uses something anything similar to golang's rune's approach or looked into it. IMO unicode reaches something like Godel's incompleteness problem. You can never actually prove that a unicode program will work properly nor prove that it won't have bugs, because unicode creates infinite gotchyas and unicode is always evolving to have more characters that you didn't know about before. It makes code inelegant compared to plain english 255 systems like in the 1970's. There is an interesting article/video about it on Sucksless, and even this guy scares me when he talks about unicode even though he is trying to fix the problems: "UTF-8 everywhere? Writing Unicode compliant software that sucks less, Laslo Hunhold" But it of course is not specific to Lazarus. Sorry for slightly off topic. -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On Fri, 21 Oct 2016 11:29:36 +0200 Gabor Boros via Lazarus wrote: > 2016. 10. 21. 10:24 keltezéssel, Juha Manninen via Lazarus írta: > > A "character" in Unicode is an ambiguous term. > > Often the good old byte (codeunit) access is very useful. > > See: > > http://wiki.freepascal.org/UTF8_strings_and_characters > > I started with the wiki pages, but 2 about UTF8 in english is too much > for me and you pointed to a 3rd... :-) > > On the above link at "Searching a substring" I read "Due to the special > nature of UTF8 you can simply use the normal string functions for > searching a sub-string.". But the example Where procedure returns this > for me: "The substring "á" is in the text "éáó" at byte position 3 and > at character position 1". Which incorrect because á is the 2nd character. Thanks for the hint. I fixed it. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
2016. 10. 21. 10:24 keltezéssel, Juha Manninen via Lazarus írta: A "character" in Unicode is an ambiguous term. Often the good old byte (codeunit) access is very useful. See: http://wiki.freepascal.org/UTF8_strings_and_characters I started with the wiki pages, but 2 about UTF8 in english is too much for me and you pointed to a 3rd... :-) On the above link at "Searching a substring" I read "Due to the special nature of UTF8 you can simply use the normal string functions for searching a sub-string.". But the example Where procedure returns this for me: "The substring "á" is in the text "éáó" at byte position 3 and at character position 1". Which incorrect because á is the 2nd character. Gabor -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
* Please read the wiki page ... -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] How to use strings properly with fixes_1_6 and FPC 3.0.0?
On Fri, Oct 21, 2016 at 10:03 AM, Gabor Boros via Lazarus wrote: > UTF8* is good to me but a compiler directive is easier to use, just don't > know why not working properly. Please the wiki page you found. It is explained there. http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#String_Literals > What is the proper way to read through (character after character) the > string if use UTF8* procedures? A "character" in Unicode is an ambiguous term. Often the good old byte (codeunit) access is very useful. See: http://wiki.freepascal.org/UTF8_strings_and_characters Juha -- ___ Lazarus mailing list Lazarus@lists.lazarus-ide.org http://lists.lazarus-ide.org/listinfo/lazarus