Re: The Case Against Autodecode
On 6/5/2016 1:05 AM, deadalnix wrote: TIL: books are read by computers. I should introduce you to a fabulous technology called OCR. :-)
Re: The Case Against Autodecode
On 6/5/2016 1:07 AM, deadalnix wrote: On Saturday, 4 June 2016 at 03:03:16 UTC, Walter Bright wrote: Oh rubbish. Let go of the idea that choosing bad fonts should drive Unicode codepoint decisions. Interestingly enough, I've mentioned earlier here that only people from the US would believe that documents with mixed languages aren't commonplace. I wasn't expecting to be proven right that fast. You'd be in error. I've been casually working on my grandfather's thesis trying to make a web version of it, and it is mixed German, French, and English. I've also made a digital version of an old history book that is mixed English, old English, German, French, Greek, old Greek, and Egyptian hieroglyphs (available on Amazons in your neighborhood!). I've also lived in Germany for 3 years, though that was before computers took over the world.
Re: The Case Against Autodecode
On Saturday, 4 June 2016 at 08:12:47 UTC, Walter Bright wrote: On 6/3/2016 11:17 PM, H. S. Teoh via Digitalmars-d wrote: On Fri, Jun 03, 2016 at 08:03:16PM -0700, Walter Bright via Digitalmars-d wrote: It works for books. Because books don't allow their readers to change the font. Unicode is not the font. This madness already exists *without* Unicode. If you have a page with a single glyph 'm' printed on it and show it to an English speaker, he will say it's lowercase M. Show it to a Russian speaker, and he will say it's lowercase Т. So which letter is it, M or Т? It's not a problem that Unicode can solve. As you said, the meaning is in the context. Unicode has no context, and tries to solve something it cannot. ('m' doesn't always mean m in english, either. It depends on the context.) Ya know, if Unicode actually solved these problems, you'd have a case. But it doesn't, and so you don't :-) If you're going to represent both languages, you cannot get away from needing to represent letters abstractly, rather than visually. Books do visually just fine! So should O and 0 share the same glyph or not? They're visually the same thing, No, they're not. Not even on old typewriters where every key was expensive. Even without the slash, the O tends to be fatter than the 0. The very fact that we distinguish between O and 0, independently of what Unicode did/does, is already proof enough that going by visual representation is inadequate. Except that you right now are using a font where they are different enough that you have no trouble at all distinguishing them without bothering to look it up. And so am I. In other words toUpper and toLower does not belong in the standard library. Great. Unicode and the standard library are two different things. Even if a character in different languages share a glyph or look identical though, it makes sense to duplicate them with different code points/units/whatever. Simple functions like isCyrillicLetter() can then do a simple less-than / greater-than comparison instead of having a lookup table to check different numeric representations scattered throughout the Unicode table. Functions like toUpper and toLower become easier to write as well (for SOME languages anyhow), it's simply myletter +/- numlettersinalphabet. Redundancy here is very helpful. Maybe instead of Unicode they should have called it Babel... :) "The Lord said, “If as one people speaking the same language they have begun to do this, then nothing they plan to do will be impossible for them. Come, let us go down and confuse their language so they will not understand each other.”" -Jon
Re: The Case Against Autodecode
On Friday, June 03, 2016 15:38:38 Walter Bright via Digitalmars-d wrote: > On 6/3/2016 2:10 PM, Jonathan M Davis via Digitalmars-d wrote: > > Actually, I would argue that the moment that Unicode is concerned with > > what > > the character actually looks like rather than what character it logically > > is that it's gone outside of its charter. The way that characters > > actually look is far too dependent on fonts, and aside from display code, > > code does not care one whit what the character looks like. > > What I meant was pretty clear. Font is an artistic style that does not > change context nor semantic meaning. If a font choice changes the meaning > then it is not a font. Well, maybe I misunderstood what was being argued, but it seemed like you've been arguing that two characters should be considered the same just because they look similar, whereas H. S. Teoh is arguing that two characters can be logically distinct while still looking similar and that they should be treated as distinct in Unicode because they're logically distinct. And if that's what's being argued, then I agree with H. S. Teoh. I expect - at least ideally - for Unicode to contain identifiers for characters that are distinct from whatever their visual representation might be. Stuff like fonts then worries about how to display them, and hopefully don't do stupid stuff like make a capital I look like a lowercase l (though they often do, unfortunately). But if two characters in different scripts - be they latin and cyrillic or whatever - happen to often look the same but would be considered two different characters by humans, then I would expect Unicode to consider them to be different, whereas if no one would reasonably consider them to be anything but exactly the same character, then there should only be one character in Unicode. However, if we really have crazy stuff where subtly different visual representations of the letter g are considered to be one character in English and two in Russian, then maybe those should be three different characters in Unicode so that the English text can clearly be operating on g, whereas the Russian text is doing whatever it does with its two characters that happen to look like g. I don't know. That sort of thing just gets ugly. But I definitely think that Unicode characters should be made up of what the logical characters are and leave the visual representation up to the fonts and the like. Now, how to deal with uppercase vs lowercase and all of that sort of stuff is a completely separate issue IMHO, and that comes down to how the characters are somehow logically associated with one another, and it's going to be very locale-specific such that it's not really part of the core of Unicode's charter IMHO (though I'm not sure that it's bad if there's a set of locale rules that go along with Unicode for those looking to correctly apply such rules - they just have nothing to do with code points and graphemes and how they're represented in code). - Jonathan M Davis
Re: The Case Against Autodecode
On Saturday, 4 June 2016 at 03:03:16 UTC, Walter Bright wrote: Oh rubbish. Let go of the idea that choosing bad fonts should drive Unicode codepoint decisions. Interestingly enough, I've mentioned earlier here that only people from the US would believe that documents with mixed languages aren't commonplace. I wasn't expecting to be proven right that fast.
Re: The Case Against Autodecode
On Friday, 3 June 2016 at 18:43:07 UTC, Walter Bright wrote: On 6/3/2016 9:28 AM, H. S. Teoh via Digitalmars-d wrote: Eventually you have no choice but to encode by logical meaning rather than by appearance, since there are many lookalikes between different languages that actually mean something completely different, and often behaves completely differently. It's almost as if printed documents and books have never existed! TIL: books are read by computers.
Re: The Case Against Autodecode
On Friday, 3 June 2016 at 12:04:39 UTC, Chris wrote: I do exactly this. Validate and normalize. And once you've done this, auto decoding is useless because the same character has the same representation anyway.
Re: The Case Against Autodecode
On 03/06/2016 20:12, Dmitry Olshansky wrote: On 02-Jun-2016 23:27, Walter Bright wrote: I wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness. Yeah, Unicode was not meant to be easy it seems. Or this is whatever happens with evolutionary design that started with "everything is a 16-bit character". Typing as someone who as spent some time creating typefaces, having two representations makes sense, and it didn't start with Unicode, it started with movable type. It is much easier for a font designer to create the two codepoint versions of characters for most instances, i.e. make the base letters and the diacritics once. Then what I often do is make single codepoint versions of the ones I'm likely to use, but only if they need more tweaking than the kerning options of the font format allow. I'll omit the history lesson on how this was similar in the case of movable type. Keyboards for different languages mean that a character that is a single keystroke in one case is two together or in sequence in another. This means that Unicode not only represents completed strings, but also those that are mid composition. The ordering that it uses to ensure that graphemes have a single canonical representation is based on the order that those multi-key characters are entered. I wouldn't call it elegant, but its not inelegant either. Trying to represent all sufficiently similar glyphs with the same codepoint would lead to a layout problem. How would you order them so that strings of any language can be sorted by their local sorting rules, without having to special case algorithms? Also consider ligatures, such as those for "ff", "fi", "ffi", "fl", "ffl" and many, many more. Typographers create these glyphs whenever available kerning tools do a poor job of combining them from the individual glyphs. From the point of view of meaning they should still be represented as individual codepoints, but for display (electronic or print) that sequence needs to be replaced with the single codepoint for the ligature. I think that in order to understand the decisions of the Unicode committee, one has to consider that they are trying to unify the concerns of representing written information from two sides. One side prioritises storage and manipulation, while the other considers aesthetics and design workflow more important. My experience of using Unicode from both sides gives me a different appreciation for the difficulties of reconciling the two. A... P.S. Then they started adding emojis, and I lost all faith in humanity ;)
Re: The Case Against Autodecode
On 6/3/2016 11:17 PM, H. S. Teoh via Digitalmars-d wrote: On Fri, Jun 03, 2016 at 08:03:16PM -0700, Walter Bright via Digitalmars-d wrote: It works for books. Because books don't allow their readers to change the font. Unicode is not the font. This madness already exists *without* Unicode. If you have a page with a single glyph 'm' printed on it and show it to an English speaker, he will say it's lowercase M. Show it to a Russian speaker, and he will say it's lowercase Т. So which letter is it, M or Т? It's not a problem that Unicode can solve. As you said, the meaning is in the context. Unicode has no context, and tries to solve something it cannot. ('m' doesn't always mean m in english, either. It depends on the context.) Ya know, if Unicode actually solved these problems, you'd have a case. But it doesn't, and so you don't :-) If you're going to represent both languages, you cannot get away from needing to represent letters abstractly, rather than visually. Books do visually just fine! So should O and 0 share the same glyph or not? They're visually the same thing, No, they're not. Not even on old typewriters where every key was expensive. Even without the slash, the O tends to be fatter than the 0. The very fact that we distinguish between O and 0, independently of what Unicode did/does, is already proof enough that going by visual representation is inadequate. Except that you right now are using a font where they are different enough that you have no trouble at all distinguishing them without bothering to look it up. And so am I. In other words toUpper and toLower does not belong in the standard library. Great. Unicode and the standard library are two different things.
Re: The Case Against Autodecode
One has also to take into consideration that Unicode is the way it is because it was not invented in an empty space. It had to take consideration of the existing and find compromisses allowing its adoption. Even if they had invented the perfect encoding, NO ONE WOULD HAVE USED IT, as it would have fubar the existing. As it was invented it allowed a (relatively smooth) transition. Here some points that made it even possible that Unicode could be adopted at all: - 16 bits: while that choice was a bit shortsighted, 16 bits is a good compromice between compactness and richness (BMP suffice to express nearly all living languages). - Using more or less the same arrangement of codepoints as in the different codepages. This allowed to transform legacy documents with simple scripts (matter of fact I wrote a script to repair misencoded Greek documents, it consisted mainly of unich = ch>0x80 ? ch+0x2D0 : ch; - Utf-8: this was the genious stroke encoding that allowed to mix it all without requiring awful acrobatics (Joakim is completely out to lunch on that one, shifting encoding without self-synchronisation are hellish, that's why Chinese and Japanese adopted Unicode without hesitation, they had enough experience with their legacy encodings. - Letting time for the transition. So all the points that people here criticize, were in fact the reason why Unicode could even be become the standard it is now.
Re: The Case Against Autodecode
On Friday, 3 June 2016 at 20:53:32 UTC, H. S. Teoh wrote: Even the Greek sigma has two forms depending on whether it's at the end of a word or not -- so should it be two code points or one? If you say two, then you'd have a problem with how to search for sigma in Greek text, and you'd have to search for either medial sigma or final sigma. But if you say one, then you'd have a problem with having two different letterforms for a single codepoint. In Unicode there are 2 different codepoints for lower case sigma ς U+03C2 and σ U+3C3 but only one uppercase Σ U+3A3 sigma. Codepoint U+3A2 is undefined. So your objection is not hypothetic, it is actually an issue for uppercase() and lowercase() functions. Another difficulty besides dotless and dotted i of Turkic, the double letters used in latin transcription of cyrillic text in east and south europe dž, lj, nj and dz, which have an uppercase forme (DŽ, LJ, NJ, DZ) and a titlecase form (Dž, Lj, Nj, Dz). Besides, that still doesn't solve the problem of what "i".uppercase() should return. In most languages, it should return "I", but in Turkish it should not. And if we really went the route of encoding Cyrillic letters the same as their Latin lookalikes, we'd have a problem with what "m".uppercase() should return, because now it depends on which font is in effect (if it's a Cyrillic cursive font, the correct answer is "Т", if it's a Latin font, the correct answer is "M" -- the other combinations: who knows). That sounds far worse than what we have today. As an anecdote I can tell the story of the accession to the European Union of Romania and Bulgaria in 2007. The issue was that 3 letters used by Romanian and Bulgarian had been forgotten by the Unicode consortium (Ș U+0218, ș U+219, Ț U+21A, ț U+21B and 2 Cyrillic letters that I do not remember). The Romanian used as a replacement Ş, ş, Ţ and ţ (U+15D, U+15E and U+161 and U+162), which look a little bit alike. When the Commission finally managed to force Mirosoft to correct the fonts to include them, we could start to correct the data. The transition was finished in 2012 and was only possible because no other language we deal with uses the "wrong" codepoints (Turkish but fortunately we only have a handful of them in our db's). So 5 years of ad hoc processing for the substicion of 4 codepoints. BTW: using combining diacritics was out of the question at that time simply because Microsoft Word didn't support it at that time and many documents we encountered still only used codepages (one has also to remember that in big institution like the EC, the IT is always several years behind the open market, which means that when product is in release X, the Institution still might use release X-5 years).
Re: The Case Against Autodecode
On Fri, Jun 03, 2016 at 08:03:16PM -0700, Walter Bright via Digitalmars-d wrote: > On 6/3/2016 6:08 PM, H. S. Teoh via Digitalmars-d wrote: > > It's not a hard concept, except that these different letters have > > lookalike forms with completely unrelated letters. Again: > > > > - Lowercase Latin m looks visually the same as lowercase Cyrillic Т > > in cursive form. In some font renderings the two are IDENTICAL > > glyphs, in spite of being completely different, unrelated letters. > > However, in non-cursive form, Cyrillic lowercase т is visually > > distinct. > > > > - Similarly, lowercase Cyrillic П in cursive font looks like > > lowercase Latin n, and in some fonts they are identical glyphs. > > Again, completely unrelated letters, yet they have the SAME VISUAL > > REPRESENTATION. However, in non-cursive font, lowercase Cyrillic П > > is п, which is visually distinct from Latin n. > > > > - These aren't the only ones, either. Other Cyrillic false friends > > include cursive Д, which in some fonts looks like lowercase Latin g. > > But in non-cursive font, it's д. > > > > Just given the above, it should be clear that going by visual > > representation is NOT enough to disambiguate between these different > > letters. > > It works for books. Because books don't allow their readers to change the font. > Unicode invented a problem, and came up with a thoroughly wretched > "solution" that we'll be stuck with for generations. One of those bad > solutions is have the reader not know what a glyph actually is without > pulling back the cover to read the codepoint. It's madness. This madness already exists *without* Unicode. If you have a page with a single glyph 'm' printed on it and show it to an English speaker, he will say it's lowercase M. Show it to a Russian speaker, and he will say it's lowercase Т. So which letter is it, M or Т? The fundamental problem is that writing systems for different languages interpret the same letter forms differently. In English, lowercase g has at least two different forms that we recognize as the same letter. However, to a Cyrillic reader the two forms are distinct, because one of them looks like a Cyrillic letter but the other one looks foreign. So should g be encoded as a single point or two different points? In a similar vein, to a Cyrillic reader the glyphs т and m represent the same letter, but to an English letter they are clearly two different things. If you're going to represent both languages, you cannot get away from needing to represent letters abstractly, rather than visually. > > By your argument, since lowercase Cyrillic Т is, visually, just m, > > it should be encoded the same way as lowercase Latin m. But this is > > untenable, because the letterform changes with a different font. So > > you end up with the unworkable idea of a font-dependent encoding. > > Oh rubbish. Let go of the idea that choosing bad fonts should drive > Unicode codepoint decisions. It's not a bad font. It's standard practice to print Cyrillic cursive letters with different glyphs. Russian readers can read both without any problem. The same letter is represented by different glyphs, and therefore the abstract letter is a more fundamental unit of meaning than the glyph itself. > > Or, to use an example closer to home, uppercase Latin O and the > > digit 0 are visually identical. Should they be encoded as a single > > code point or two? Worse, in some fonts, the digit 0 is rendered > > like Ø (to differentiate it from uppercase O). Does that mean that > > it should be encoded the same way as the Danish letter Ø? Obviously > > not, but according to your "visual representation" idea, the answer > > should be yes. > > Don't confuse fonts with code points. It'd be adequate if Unicode > defined a canonical glyph for each code point, and let the font makers > do what they wish. So should O and 0 share the same glyph or not? They're visually the same thing, even though some fonts render them differently. What should be the canonical shape of O vs. 0? If they are the same shape, then by your argument they must be the same code point, regardless of what font makers do to disambiguate them. Good luck writing a parser that can't tell between an identifier that begins with O vs. a number literal that begins with 0. The very fact that we distinguish between O and 0, independently of what Unicode did/does, is already proof enough that going by visual representation is inadequate. > > > The notion of 'case' should not be part of Unicode, as that is > > > semantic information that is beyond the scope of Unicode. > > But what should "i".toUpper return? > > Not relevant to my point that Unicode shouldn't decide what "upper > case" for all languages means, any more than Unicode should specify a > font. Now when you argue that Unicode should make such decisions, note > what a spectacularly hopeless job of it they've done. In other words toUpper and toLower does not belong in the
Re: The Case Against Autodecode
On 6/3/2016 6:08 PM, H. S. Teoh via Digitalmars-d wrote: It's not a hard concept, except that these different letters have lookalike forms with completely unrelated letters. Again: - Lowercase Latin m looks visually the same as lowercase Cyrillic Т in cursive form. In some font renderings the two are IDENTICAL glyphs, in spite of being completely different, unrelated letters. However, in non-cursive form, Cyrillic lowercase т is visually distinct. - Similarly, lowercase Cyrillic П in cursive font looks like lowercase Latin n, and in some fonts they are identical glyphs. Again, completely unrelated letters, yet they have the SAME VISUAL REPRESENTATION. However, in non-cursive font, lowercase Cyrillic П is п, which is visually distinct from Latin n. - These aren't the only ones, either. Other Cyrillic false friends include cursive Д, which in some fonts looks like lowercase Latin g. But in non-cursive font, it's д. Just given the above, it should be clear that going by visual representation is NOT enough to disambiguate between these different letters. It works for books. Unicode invented a problem, and came up with a thoroughly wretched "solution" that we'll be stuck with for generations. One of those bad solutions is have the reader not know what a glyph actually is without pulling back the cover to read the codepoint. It's madness. By your argument, since lowercase Cyrillic Т is, visually, just m, it should be encoded the same way as lowercase Latin m. But this is untenable, because the letterform changes with a different font. So you end up with the unworkable idea of a font-dependent encoding. Oh rubbish. Let go of the idea that choosing bad fonts should drive Unicode codepoint decisions. Or, to use an example closer to home, uppercase Latin O and the digit 0 are visually identical. Should they be encoded as a single code point or two? Worse, in some fonts, the digit 0 is rendered like Ø (to differentiate it from uppercase O). Does that mean that it should be encoded the same way as the Danish letter Ø? Obviously not, but according to your "visual representation" idea, the answer should be yes. Don't confuse fonts with code points. It'd be adequate if Unicode defined a canonical glyph for each code point, and let the font makers do what they wish. The notion of 'case' should not be part of Unicode, as that is semantic information that is beyond the scope of Unicode. But what should "i".toUpper return? Not relevant to my point that Unicode shouldn't decide what "upper case" for all languages means, any more than Unicode should specify a font. Now when you argue that Unicode should make such decisions, note what a spectacularly hopeless job of it they've done.
Re: The Case Against Autodecode
On Saturday, 4 June 2016 at 02:46:31 UTC, Walter Bright wrote: On 6/3/2016 5:42 PM, ketmar wrote: sometimes used Cyrillic font to represent English. Nobody here suggested using the wrong font, it's completely irrelevant. you suggested that unicode designers should make similar-looking glyphs share the same code, and it reminds me this little story. maybe i misunderstood you, though.
Re: The Case Against Autodecode
On 6/3/2016 5:42 PM, ketmar wrote: sometimes used Cyrillic font to represent English. Nobody here suggested using the wrong font, it's completely irrelevant.
Re: The Case Against Autodecode
On Fri, Jun 03, 2016 at 03:35:18PM -0700, Walter Bright via Digitalmars-d wrote: > On 6/3/2016 1:53 PM, H. S. Teoh via Digitalmars-d wrote: [...] > > 'Cos by that argument, serif and sans serif letters should have > > different encodings, because in languages like Hebrew, a tiny little > > serif could mean the difference between two completely different > > letters. > > If they are different letters, then they should have a different code > point. I don't see why this is such a hard concept. [...] It's not a hard concept, except that these different letters have lookalike forms with completely unrelated letters. Again: - Lowercase Latin m looks visually the same as lowercase Cyrillic Т in cursive form. In some font renderings the two are IDENTICAL glyphs, in spite of being completely different, unrelated letters. However, in non-cursive form, Cyrillic lowercase т is visually distinct. - Similarly, lowercase Cyrillic П in cursive font looks like lowercase Latin n, and in some fonts they are identical glyphs. Again, completely unrelated letters, yet they have the SAME VISUAL REPRESENTATION. However, in non-cursive font, lowercase Cyrillic П is п, which is visually distinct from Latin n. - These aren't the only ones, either. Other Cyrillic false friends include cursive Д, which in some fonts looks like lowercase Latin g. But in non-cursive font, it's д. Just given the above, it should be clear that going by visual representation is NOT enough to disambiguate between these different letters. By your argument, since lowercase Cyrillic Т is, visually, just m, it should be encoded the same way as lowercase Latin m. But this is untenable, because the letterform changes with a different font. So you end up with the unworkable idea of a font-dependent encoding. Similarly, since lowercase Cyrillic П is n (in cursive font), we should encode it the same way as Latin lowercase n. But again, the letterform changes based on font. Your criteria of "same visual representation" does not work outside of English. What you imagine to be a simple, straightforward concept is far from being simple once you're dealing with the diverse languages and writing systems of the world. Or, to use an example closer to home, uppercase Latin O and the digit 0 are visually identical. Should they be encoded as a single code point or two? Worse, in some fonts, the digit 0 is rendered like Ø (to differentiate it from uppercase O). Does that mean that it should be encoded the same way as the Danish letter Ø? Obviously not, but according to your "visual representation" idea, the answer should be yes. The bottomline is that uppercase O and the digit 0 represent different LOGICAL entities, in spite of their sharing the same visual representation. Eventually you have to resort to representing *logical* entities ("characters") rather than visual appearance, which is a property of the font, and has no place in a digital text encoding. > > Besides, that still doesn't solve the problem of what > > "i".uppercase() should return. In most languages, it should return > > "I", but in Turkish it should not. > > And if we really went the route of encoding Cyrillic letters the > > same as their Latin lookalikes, we'd have a problem with what > > "m".uppercase() should return, because now it depends on which font > > is in effect (if it's a Cyrillic cursive font, the correct answer is > > "Т", if it's a Latin font, the correct answer is "M" -- the other > > combinations: who knows). That sounds far worse than what we have > > today. > > The notion of 'case' should not be part of Unicode, as that is > semantic information that is beyond the scope of Unicode. But what should "i".toUpper return? Or are you saying the standard library should not include such a basic function as a case-changing function? T -- Customer support: the art of getting your clients to pay for your own incompetence.
Re: The Case Against Autodecode
On Friday, 3 June 2016 at 18:43:07 UTC, Walter Bright wrote: It's almost as if printed documents and books have never existed! some old xUSSR books which has some English text sometimes used Cyrillic font to represent English. it was awful, and barely readable. this was done to ease the work of compositors, and the result was unacceptable. do you feel a recognizable pattern here? ;-)
Re: The Case Against Autodecode
On Friday, 3 June 2016 at 22:38:38 UTC, Walter Bright wrote: If a font choice changes the meaning then it is not a font. Nah, then it is an Awesome Font that is totally Web Scale! i wish i was making that up http://fontawesome.io/ i hate that thing But, it is kinda legal: gotta love the Unicode private use area!
Re: The Case Against Autodecode
On 6/3/2016 2:10 PM, Jonathan M Davis via Digitalmars-d wrote: Actually, I would argue that the moment that Unicode is concerned with what the character actually looks like rather than what character it logically is that it's gone outside of its charter. The way that characters actually look is far too dependent on fonts, and aside from display code, code does not care one whit what the character looks like. What I meant was pretty clear. Font is an artistic style that does not change context nor semantic meaning. If a font choice changes the meaning then it is not a font.
Re: The Case Against Autodecode
On 6/3/2016 1:53 PM, H. S. Teoh via Digitalmars-d wrote: But if we were to encode appearance instead of logical meaning, that would mean the *same* lowercase Cyrillic ь would have multiple, different encodings depending on which font was in use. I don't see that consequence at all. That doesn't seem like the right solution either. Do we really want Unicode strings to encode font information too?? No. 'Cos by that argument, serif and sans serif letters should have different encodings, because in languages like Hebrew, a tiny little serif could mean the difference between two completely different letters. If they are different letters, then they should have a different code point. I don't see why this is such a hard concept. And what of the Arabic and Indic scripts? They would need to encode the same letter multiple times, each being a variation of the physical form that changes depending on the surrounding context. Even the Greek sigma has two forms depending on whether it's at the end of a word or not -- so should it be two code points or one? Two. Again, why is this hard to grasp? If there is meaning in having two different visual representations, then they are two codepoints. If the visual representation is the same, then it is one codepoint. If the difference is only due to font selection, that it is the same codepoint. Besides, that still doesn't solve the problem of what "i".uppercase() should return. In most languages, it should return "I", but in Turkish it should not. And if we really went the route of encoding Cyrillic letters the same as their Latin lookalikes, we'd have a problem with what "m".uppercase() should return, because now it depends on which font is in effect (if it's a Cyrillic cursive font, the correct answer is "Т", if it's a Latin font, the correct answer is "M" -- the other combinations: who knows). That sounds far worse than what we have today. The notion of 'case' should not be part of Unicode, as that is semantic information that is beyond the scope of Unicode.
Re: The Case Against Autodecode
On Friday, June 03, 2016 03:08:43 Walter Bright via Digitalmars-d wrote: > On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote: > > At the time > > Unicode also had to grapple with tricky issues like what to do with > > lookalike characters that served different purposes or had different > > meanings, e.g., the mu sign in the math block vs. the real letter mu in > > the Greek block, or the Cyrillic A which looks and behaves exactly like > > the Latin A, yet the Cyrillic Р, which looks like the Latin P, does > > *not* mean the same thing (it's the equivalent of R), or the Cyrillic В > > whose lowercase is в not b, and also had a different sound, but > > lowercase Latin b looks very similar to Cyrillic ь, which serves a > > completely different purpose (the uppercase is Ь, not B, you see). > > I don't see that this is tricky at all. Adding additional semantic meaning > that does not exist in printed form was outside of the charter of Unicode. > Hence there is no justification for having two distinct characters with > identical glyphs. > > They should have put me in charge of Unicode. I'd have put a stop to much of > the madness :-) Actually, I would argue that the moment that Unicode is concerned with what the character actually looks like rather than what character it logically is that it's gone outside of its charter. The way that characters actually look is far too dependent on fonts, and aside from display code, code does not care one whit what the character looks like. For instance, take the capital letter I, the lowercase letter l, and the number one. In some fonts that are feeling cruel towards folks who actually want to read them, two of those characters - or even all three of them - look identical. But I think that you'll agree that those characters should be represented as distinct characters in Unicode regardless of what they happen to look like in a particular font. Now, take a cyrllic letter that looks similar to a latin letter. If they're logically equivalent such that no code would ever want to distinguish between the two and such that no font would ever even consider representing them differently, then they're truly the same letter, and they should only have one Unicode representation. But if anyone would ever consider them to be logically distinct, then it makes no sense for them to be considered to be the same character by Unicode, because they don't have the same identity. And that distinction is quite clear if any font would ever consider representing the two characters differently, no matter how slight that difference might be. Really, what a character looks like has nothing to do with Unicode. The exact same Unicode is used regardless of how the text is displayed. Rather, what Unicode is doing is providing logical identifiers for characters so that code can operate on them, and display code can then do whatever it does to display those characters, whether they happen to look similar or not. I would think that the fact that non-display code does not care one whit about what a character looks like and that display code can have drastically different visual representations for the same character would make it clear that Unicode is concerned with having identifiers for logical characters and that that is distinct from any visual representation. - Jonathan M Davis
Re: The Case Against Autodecode
On Fri, Jun 03, 2016 at 11:43:07AM -0700, Walter Bright via Digitalmars-d wrote: > On 6/3/2016 9:28 AM, H. S. Teoh via Digitalmars-d wrote: > > Eventually you have no choice but to encode by logical meaning > > rather than by appearance, since there are many lookalikes between > > different languages that actually mean something completely > > different, and often behaves completely differently. > > It's almost as if printed documents and books have never existed! But if we were to encode appearance instead of logical meaning, that would mean the *same* lowercase Cyrillic ь would have multiple, different encodings depending on which font was in use. That doesn't seem like the right solution either. Do we really want Unicode strings to encode font information too?? 'Cos by that argument, serif and sans serif letters should have different encodings, because in languages like Hebrew, a tiny little serif could mean the difference between two completely different letters. And what of the Arabic and Indic scripts? They would need to encode the same letter multiple times, each being a variation of the physical form that changes depending on the surrounding context. Even the Greek sigma has two forms depending on whether it's at the end of a word or not -- so should it be two code points or one? If you say two, then you'd have a problem with how to search for sigma in Greek text, and you'd have to search for either medial sigma or final sigma. But if you say one, then you'd have a problem with having two different letterforms for a single codepoint. Besides, that still doesn't solve the problem of what "i".uppercase() should return. In most languages, it should return "I", but in Turkish it should not. And if we really went the route of encoding Cyrillic letters the same as their Latin lookalikes, we'd have a problem with what "m".uppercase() should return, because now it depends on which font is in effect (if it's a Cyrillic cursive font, the correct answer is "Т", if it's a Latin font, the correct answer is "M" -- the other combinations: who knows). That sounds far worse than what we have today. T -- Let's eat some disquits while we format the biskettes.
Re: The Case Against Autodecode
On 6/3/2016 11:54 AM, Timon Gehr wrote: On 03.06.2016 20:41, Walter Bright wrote: How did people ever get by with printed books and documents? They can disambiguate the letters based on context well enough. Characters do not have semantic meaning. Their meaning is always inferred from the context. Unicode's troubles started the moment they stepped beyond their charter.
Re: The Case Against Autodecode
On 02-Jun-2016 23:27, Walter Bright wrote: On 6/2/2016 12:34 PM, deadalnix wrote: On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote: Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without. False. Many characters can be represented by different sequences of codepoints. For instance, ê can be ê as one codepoint or ^ as a modifier followed by e. ö is one such character. There are 3 levels of Unicode support. What Andrei is talking about is Level 1. http://unicode.org/reports/tr18/tr18-5.1.html I wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness. Yeah, Unicode was not meant to be easy it seems. Or this is whatever happens with evolutionary design that started with "everything is a 16-bit character". -- Dmitry Olshansky
Re: The Case Against Autodecode
On Friday, 3 June 2016 at 18:41:36 UTC, Walter Bright wrote: How did people ever get by with printed books and documents? Printed books pick one font and one layout, then is read by people. It doesn't have to be represented in some format where end users can change the font and size etc.
Re: The Case Against Autodecode
On 03.06.2016 20:41, Walter Bright wrote: On 6/3/2016 3:14 AM, Vladimir Panteleev wrote: That's not right either. Cyrillic letters can look slightly different from their latin lookalikes in some circumstances. I'm sure there are extremely good reasons for not using the latin lookalikes in the Cyrillic alphabets, because most (all?) 8-bit Cyrillic encodings use separate codes for the lookalikes. It's not restricted to Unicode. How did people ever get by with printed books and documents? They can disambiguate the letters based on context well enough.
Re: The Case Against Autodecode
On 6/3/2016 9:28 AM, H. S. Teoh via Digitalmars-d wrote: Eventually you have no choice but to encode by logical meaning rather than by appearance, since there are many lookalikes between different languages that actually mean something completely different, and often behaves completely differently. It's almost as if printed documents and books have never existed!
Re: The Case Against Autodecode
On 6/3/2016 3:14 AM, Vladimir Panteleev wrote: That's not right either. Cyrillic letters can look slightly different from their latin lookalikes in some circumstances. I'm sure there are extremely good reasons for not using the latin lookalikes in the Cyrillic alphabets, because most (all?) 8-bit Cyrillic encodings use separate codes for the lookalikes. It's not restricted to Unicode. How did people ever get by with printed books and documents?
Re: The Case Against Autodecode
On 6/3/2016 3:10 AM, Vladimir Panteleev wrote: I don't think it would work (or at least, the analogy doesn't hold). It would mean that you can't add new precomposited characters, because that means that previously valid sequences are now invalid. So don't add new precomposited characters when a recognized existing sequence exists.
Re: The Case Against Autodecode
On Fri, Jun 03, 2016 at 10:14:15AM +, Vladimir Panteleev via Digitalmars-d wrote: > On Friday, 3 June 2016 at 10:08:43 UTC, Walter Bright wrote: > > On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote: > > > At the time Unicode also had to grapple with tricky issues like > > > what to do with lookalike characters that served different > > > purposes or had different meanings, e.g., the mu sign in the math > > > block vs. the real letter mu in the Greek block, or the Cyrillic A > > > which looks and behaves exactly like the Latin A, yet the Cyrillic > > > Р, which looks like the Latin P, does *not* mean the same thing > > > (it's the equivalent of R), or the Cyrillic В whose lowercase is в > > > not b, and also had a different sound, but lowercase Latin b looks > > > very similar to Cyrillic ь, which serves a completely different > > > purpose (the uppercase is Ь, not B, you see). > > > > I don't see that this is tricky at all. Adding additional semantic > > meaning that does not exist in printed form was outside of the > > charter of Unicode. Hence there is no justification for having two > > distinct characters with identical glyphs. > > That's not right either. Cyrillic letters can look slightly different > from their latin lookalikes in some circumstances. > > I'm sure there are extremely good reasons for not using the latin > lookalikes in the Cyrillic alphabets, because most (all?) 8-bit > Cyrillic encodings use separate codes for the lookalikes. It's not > restricted to Unicode. Yeah, lowercase Cyrillic П is п, which looks like lowercase Greek π in some fonts, but in cursive form it looks more like Latin lowercase n. It wouldn't make sense to encode Cyrillic п the same as Greek π or Latin lowercase n just by appearance, since logically it stands as its own character despite its various appearances. But it wouldn't make sense to encode it differently just because you're using a different font! Similarly, lowercase Cyrillic т in some cursive fonts looks like lowercase Latin m. I don't think it would make sense to encode lowercase Т as Latin m just because of that. Eventually you have no choice but to encode by logical meaning rather than by appearance, since there are many lookalikes between different languages that actually mean something completely different, and often behaves completely differently. T -- People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG
Re: The Case Against Autodecode
On 06/02/2016 05:37 PM, Andrei Alexandrescu wrote: On 6/2/16 5:35 PM, deadalnix wrote: On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote: On 6/2/16 5:20 PM, deadalnix wrote: The good thing when you define works by whatever it does right now No, it works as it was designed. -- Andrei Nobody says it doesn't. Everybody says the design is crap. I think I like it more after this thread. -- Andrei Well there's a fantastic argument.
Re: The Case Against Autodecode
On Friday, 3 June 2016 at 11:46:50 UTC, Jonathan M Davis wrote: On Friday, June 03, 2016 10:10:18 Vladimir Panteleev via Digitalmars-d wrote: On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote: > On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote: >> However, this >> meant that some precomposed characters were "redundant": >> they >> represented character + diacritic combinations that could >> equally well >> be expressed separately. Normalization was the inevitable >> consequence. > > It is not inevitable. Simply disallow the 2 codepoint > sequences - the single one has to be used instead. > > There is precedent. Some characters can be encoded with more > than one UTF-8 sequence, and the longer sequences were > declared invalid. Simple. > > I.e. have the normalization up front when the text is > created rather than everywhere else. I don't think it would work (or at least, the analogy doesn't hold). It would mean that you can't add new precomposited characters, because that means that previously valid sequences are now invalid. I would have argued that no composited characters should have ever existed regardless of what was done in previous encodings, since they're redundant, and you need the non-composited characters to avoid a combinatorial explosion of characters, so you can't have characters that just have a composited version and be consistent. However, the Unicode folks obviously didn't go that route. But given where we sit now, even though we're stuck with some composited characters, I'd argue that we should at least never add any new ones. But who knows what the Unicode folks are actually going to do. As it is, you probably should normalize strings in many cases where they enter the program, just like ideally, you'd validate them when they enter the program. But regardless, you have to deal with the fact that multiple normalization schemes exist and that there's no guarantee that you're even going to get valid Unicode, let alone Unicode that's normalized the way you want. - Jonathan M Davis I do exactly this. Validate and normalize.
Re: The Case Against Autodecode
On Friday, June 03, 2016 10:10:18 Vladimir Panteleev via Digitalmars-d wrote: > On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote: > > On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote: > >> However, this > >> meant that some precomposed characters were "redundant": they > >> represented character + diacritic combinations that could > >> equally well > >> be expressed separately. Normalization was the inevitable > >> consequence. > > > > It is not inevitable. Simply disallow the 2 codepoint sequences > > - the single one has to be used instead. > > > > There is precedent. Some characters can be encoded with more > > than one UTF-8 sequence, and the longer sequences were declared > > invalid. Simple. > > > > I.e. have the normalization up front when the text is created > > rather than everywhere else. > > I don't think it would work (or at least, the analogy doesn't > hold). It would mean that you can't add new precomposited > characters, because that means that previously valid sequences > are now invalid. I would have argued that no composited characters should have ever existed regardless of what was done in previous encodings, since they're redundant, and you need the non-composited characters to avoid a combinatorial explosion of characters, so you can't have characters that just have a composited version and be consistent. However, the Unicode folks obviously didn't go that route. But given where we sit now, even though we're stuck with some composited characters, I'd argue that we should at least never add any new ones. But who knows what the Unicode folks are actually going to do. As it is, you probably should normalize strings in many cases where they enter the program, just like ideally, you'd validate them when they enter the program. But regardless, you have to deal with the fact that multiple normalization schemes exist and that there's no guarantee that you're even going to get valid Unicode, let alone Unicode that's normalized the way you want. - Jonathan M Davis
Re: The Case Against Autodecode
On Friday, 3 June 2016 at 10:08:43 UTC, Walter Bright wrote: On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote: At the time Unicode also had to grapple with tricky issues like what to do with lookalike characters that served different purposes or had different meanings, e.g., the mu sign in the math block vs. the real letter mu in the Greek block, or the Cyrillic A which looks and behaves exactly like the Latin A, yet the Cyrillic Р, which looks like the Latin P, does *not* mean the same thing (it's the equivalent of R), or the Cyrillic В whose lowercase is в not b, and also had a different sound, but lowercase Latin b looks very similar to Cyrillic ь, which serves a completely different purpose (the uppercase is Ь, not B, you see). I don't see that this is tricky at all. Adding additional semantic meaning that does not exist in printed form was outside of the charter of Unicode. Hence there is no justification for having two distinct characters with identical glyphs. That's not right either. Cyrillic letters can look slightly different from their latin lookalikes in some circumstances. I'm sure there are extremely good reasons for not using the latin lookalikes in the Cyrillic alphabets, because most (all?) 8-bit Cyrillic encodings use separate codes for the lookalikes. It's not restricted to Unicode.
Re: The Case Against Autodecode
On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote: On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote: However, this meant that some precomposed characters were "redundant": they represented character + diacritic combinations that could equally well be expressed separately. Normalization was the inevitable consequence. It is not inevitable. Simply disallow the 2 codepoint sequences - the single one has to be used instead. There is precedent. Some characters can be encoded with more than one UTF-8 sequence, and the longer sequences were declared invalid. Simple. I.e. have the normalization up front when the text is created rather than everywhere else. I don't think it would work (or at least, the analogy doesn't hold). It would mean that you can't add new precomposited characters, because that means that previously valid sequences are now invalid.
Re: The Case Against Autodecode
On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote: At the time Unicode also had to grapple with tricky issues like what to do with lookalike characters that served different purposes or had different meanings, e.g., the mu sign in the math block vs. the real letter mu in the Greek block, or the Cyrillic A which looks and behaves exactly like the Latin A, yet the Cyrillic Р, which looks like the Latin P, does *not* mean the same thing (it's the equivalent of R), or the Cyrillic В whose lowercase is в not b, and also had a different sound, but lowercase Latin b looks very similar to Cyrillic ь, which serves a completely different purpose (the uppercase is Ь, not B, you see). I don't see that this is tricky at all. Adding additional semantic meaning that does not exist in printed form was outside of the charter of Unicode. Hence there is no justification for having two distinct characters with identical glyphs. They should have put me in charge of Unicode. I'd have put a stop to much of the madness :-)
Re: The Case Against Autodecode
On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote: However, this meant that some precomposed characters were "redundant": they represented character + diacritic combinations that could equally well be expressed separately. Normalization was the inevitable consequence. It is not inevitable. Simply disallow the 2 codepoint sequences - the single one has to be used instead. There is precedent. Some characters can be encoded with more than one UTF-8 sequence, and the longer sequences were declared invalid. Simple. I.e. have the normalization up front when the text is created rather than everywhere else.
Re: The Case Against Autodecode
On Thursday, June 02, 2016 15:05:44 Andrei Alexandrescu via Digitalmars-d wrote: > The intent of autodecoding was to make std.algorithm work meaningfully > with strings. As it's easy to see I just went through > std.algorithm.searching alphabetically and found issues literally with > every primitive in there. It's an easy exercise to go forth with the others. It comes down to the question of whether it's better to fail quickly when Unicode is handled incorrectly so that it's obvious that you're doing it wrong, or whether it's better for it to work in a large number of cases so that for a lot of code it "just works" but is still wrong in the general case, and it's a lot less obvious that that's the case, so many folks won't realize that they need to do more in order to have their string handling be Unicode-correct. With code units - especially UTF-8 - it becomes obvious very quickly that treating each element of the string/range as a character is wrong. With code points, you have to work far harder to find examples that are incorrect. So, it's not at all obvious (especially to the lay programmer) that the Unicode handling is incorrect and that their code is wrong - but their code will end up working a large percentage of the time in spite of it being wrong in the general case. So, yes, it's trivial to show how operating on ranges of code units as if they were characters gives incorrect results far more easily than operating on ranges of code points does. But operating on code points as if they were characters is still going to give incorrect results in the general case. Regardless of auto-decoding, the anwser is that the programmer needs to understand the Unicode issues and use ranges of code units or code points where appropriate and use ranges of graphemes where appropriate. It's just that if we default to handling code points, then a lot of code will be written which treats those as characters, and it will provide the correct result more often than it would if it treated code units as characters. In any case, I've probably posted too much in this thread already. It's clear that the first step to solving this problem is to improve Phobos so that it handles ranges of code units, code points, and graphemes correctly whether auto-decoding is involved or not, and only then can we consider the possibility of removing auto-decoding (and even then, the answer may still be that we're stuck, because we consider the resulting code breakage to be too great). But whether Phobos retains auto-decoding or not, the Unicode handling stuff in general is the same, and what we need to do to improve the siutation is the same. So, clearly, I need to do a much better job of finding time to work on D so that I can create some PRs to help the situation. Unfortunately, it's far easier to find a few minutes here and there while waiting on other stuff to shoot off a post or two in the newsgroup than it is to find time to substantively work on code. :| - Jonathan M Davis
Re: The Case Against Autodecode
On Thu, Jun 02, 2016 at 05:19:48PM -0700, Walter Bright via Digitalmars-d wrote: > On 6/2/2016 3:27 PM, John Colvin wrote: > > > I wonder what rationale there is for Unicode to have two different > > > sequences of codepoints be treated as the same. It's madness. > > > > There are languages that make heavy use of diacritics, often several > > on a single "character". Hebrew is a good example. Should there be > > only one valid ordering of any given set of diacritics on any given > > character? > > I didn't say ordering, I said there should be no such thing as > "normalization" in Unicode, where two codepoints are considered to be > identical to some other codepoint. I think it was a combination of historical baggage and trying to accomodate unusual but still valid use cases. The historical baggage was that Unicode was trying to unify all of the various already-existing codepages out there, and many of those codepages already come with various precomposed characters. To maximize compatibility with existing codepages, Unicode tried to preserve as much of the original mappings as possible within each 256-point block, so these precomposed characters became part of the standard. However, there weren't enough of them -- some people demanded less common character + diacritic combinations, and some languages had writing so complex their characters had to be composed from more basic parts. The original Unicode range was 16-bit, so there wasn't enough room to fit all of the precomposed characters people demanded, plus there were other things people wanted, like multiple diacritics (e.g., in IPA). So the concept of combining diacritics was invented, in part to prevent combinatorial explosion from soaking up the available code point space, in part to allow for novel combinations of diacritics that somebody out there somewhere might want to represent. However, this meant that some precomposed characters were "redundant": they represented character + diacritic combinations that could equally well be expressed separately. Normalization was the inevitable consequence. (Normalization, of course, also subsumes a few other things, such as collation, but this is one of the factors behind it.) (This is a greatly over-simplified description, of course. At the time Unicode also had to grapple with tricky issues like what to do with lookalike characters that served different purposes or had different meanings, e.g., the mu sign in the math block vs. the real letter mu in the Greek block, or the Cyrillic A which looks and behaves exactly like the Latin A, yet the Cyrillic Р, which looks like the Latin P, does *not* mean the same thing (it's the equivalent of R), or the Cyrillic В whose lowercase is в not b, and also had a different sound, but lowercase Latin b looks very similar to Cyrillic ь, which serves a completely different purpose (the uppercase is Ь, not B, you see). Then you have the wonderful Indic and Arabic cursive writings, where letterforms mutate depending on the surrounding context, which, if you were to include all variants as distinct code points, would occupy many more pages than they currently do. And also sticky issues like the oft-mentioned Turkish i, which is encoded as a Latin i but behaves differently w.r.t. upper/lowercasing when in Turkish locale -- some cases of this, IIRC, are unfixable bugs in Phobos because we currently do not handle locales. So you see, imagining that code points == the solution to Unicode string handling is a joke. Writing correct Unicode handling is *hard*.) As with all sufficiently complex software projects, Unicode represents a compromise between many contradictory factors -- writing systems in the world being the complex, not-very-consistent beasts they are -- so such "dirty" details are somewhat inevitable. T -- Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
Re: The Case Against Autodecode
Am Thu, 2 Jun 2016 18:54:21 -0400 schrieb Andrei Alexandrescu: > On 06/02/2016 06:10 PM, Marco Leise wrote: > > Am Thu, 2 Jun 2016 15:05:44 -0400 > > schrieb Andrei Alexandrescu : > > > >> On 06/02/2016 01:54 PM, Marc Schütz wrote: > >>> Which practical tasks are made possible (and work _correctly_) if you > >>> decode to code points, that don't already work with code units? > >> > >> Pretty much everything. > >> > >> s.all!(c => c == 'ö') > > > > Andrei, your ignorance is really starting to grind on > > everyones nerves. > > Indeed there seem to be serious questions about my competence, basic > comprehension, and now knowledge. That's not my general impression, but something is different with this thread. > I understand it is tempting to assume that a disagreement is caused by > the other simply not understanding the matter. Even if that were true > it's not worth sacrificing civility over it. Civility has had us caught in an 36 pages long, tiresome debate with us mostly talking past each other. I was being impolite and can't say I regret it, because I prefer this answer over the rest of the thread. It's more informed, elaborate and conclusive. > > If after 350 posts you still don't see > > why this is incorrect: s.any!(c => c == 'o'), you must be > > actively skipping the informational content of this thread. > > Is it 'o' with an umlaut or without? > > At any rate, consider s of type string and x of type dchar. > The dchar type is defined as "a Unicode code point", or at > least my understanding that has been a reasonable definition > to operate with in the D language ever since its first > release. Also in the D language, the various string types > char[], wchar[] etc. with their respective qualified > versions are meant to hold Unicode strings with one of the > UTF8, UTF16, and UTF32 encodings. > > Following these definitions, it stands to reason to infer that the call > s.find(c => c == x) means "find the code point x in string s and return > the balance of s positioned there". It's prima facie application of the > definitions of the entities involved. > > Is this the only possible or recommended meaning? Most likely not, viz. > the subtle cases in which a given grapheme is represented via either one > or multiple code points by means of combining characters. Is it the best > possible meaning? It's even difficult to define what "best" means > (fastest, covering most languages, etc). > > I'm not claiming that meaning is the only possible, the only > recommended, or the best possible. All I'm arguing is that it's not > retarded, and within a certain universe confined to operating at code > point level (which is reasonable per the definitions of the types > involved) it can be considered correct. > > If at any point in the reasoning above some rampant ignorance comes > about, please point it out. No, it's pretty close now. We can all agree that there is no "best" way, only different use cases. Just defining Phobos to work on code points gives the false illusion that it does the correct thing in all use cases - after all D claims to support Unicode. But in case you wanted to iterate on visual letters it is incorrect and otherwise slow when you work on ASCII structured formats (JSON, XML, paths, Warp, ...). Then there is explaining the different default iteration schemes when using foreach vs. range API (no big deal, just not easily justified) and the cost of implementation when dealing with char[]/wchar[]. From this observation we concluded that decoding should be opt-in and that when we need it, it should be a conscious decision. Unicode is quite complex and learning about the difference between code points and grapheme clusters when segmenting strings will benefit code quality. As for the question, do multi-code-point graphemes ever appear in the wild ? OS X is known to use NFD on its native file system and there is a hint on Wikipedia that some symbols from Thai or Hindi's Devanagari need them: https://en.wikipedia.org/wiki/UTF-8#Disadvantages Some form of Lithuanian seems to have a use for them, too: http://www.unicode.org/L2/L2012/12026r-n4191-lithuanian.pdf Aside from those there is nothing generally wrong about decomposed letters appearing in strings, even though the use of NFC is encouraged. > > […harsh tone removed…] in the end we have to assume you > > will make a decisive vote against any PR with the intent > > to remove auto-decoding from Phobos. > > This seems to assume I have some vesting in the position > that makes it independent of facts. That is not the case. I > do what I think is right to do, and you do what you think is > right to do. Your vote outweighs that of many others for better or worse. When a decision needs to be made and the community is divided, we need you or Walter or anyone who is invested in the matter to cast a ruling vote. However when several dozen people
Re: The Case Against Autodecode
On Thu, Jun 02, 2016 at 04:29:48PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: > On 06/02/2016 04:22 PM, cym13 wrote: > > > > A:“We should decode to code points” > > B:“No, decoding to code points is a stupid idea.” > > A:“No it's not!” > > B:“Can you show a concrete example where it does something useful?” > > A:“Sure, look at that!” > > B:“This isn't working at all, look at all those counter-examples!” > > A:“It may not work for your examples but look how easy it is to > > find code points!” > > With autodecoding all of std.algorithm operates correctly on code points. > Without it all it does for strings is gibberish. -- Andrei With ASCII strings, all of std.algorithm operates correctly on ASCII bytes. So let's standardize on ASCII strings. What a vacuous argument! Basically you're saying "I define code points to be correct. Therefore, I conclude that decoding to code points is correct." Well, duh. Unfortunately such vacuous conclusions have no bearing in the real world of Unicode handling. T -- I am Ohm of Borg. Resistance is voltage over current.
Re: The Case Against Autodecode
On Thu, Jun 02, 2016 at 04:28:45PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: > On 06/02/2016 04:17 PM, Timon Gehr wrote: > > I.e. you are saying that 'works' means 'operates on code points'. > > Affirmative. -- Andrei Again, a ridiculous position. I can use exactly the same line of argument for why we should just standardize on ASCII. All I have to do is to define "work" to mean "operates on an ASCII character", and then every ASCII algorithm "works" by definition, so nobody can argue with me. Unfortunately, everybody else's definition of "work" is different from mine, so the argument doesn't hold water. Similarly, you are the only one whose definition of "work" means "operates on code points". Basically nobody else here uses that definition, so while you may be right according to your own made-up tautological arguments, none of your conclusions actually have any bearing in the real world of Unicode handling. Give it up. It is beyond reasonable doubt that autodecoding is a liability. D should be moving away from autodecoding instead of clinging to historical mistakes in the face of overwhelming evidence. (And note, I said *auto*-decoding; decoding by itself obviously is very relevant. But it needs to be opt-in because of its performance and correctness implications. The user needs to be able to choose whether to decode, and how to decode.) T -- Freedom: (n.) Man's self-given right to be enslaved by his own depravity.
Re: The Case Against Autodecode
On Thu, Jun 02, 2016 at 04:38:28PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: > On 06/02/2016 04:36 PM, tsbockman wrote: > > Your examples will pass or fail depending on how (and whether) the > > 'ö' grapheme is normalized. > > And that's fine. Want graphemes, .byGrapheme wags its tail in that > corner. Otherwise, you work on code points which is a completely > meaningful way to go about things. What's not meaningful is the random > results you get from operating on code units. > > > They only ever succeeds because 'ö' happens to be one of the > > privileged graphemes that *can* be (but often isn't!) represented as > > a single code point. Many other graphemes have no such > > representation. > > Then there's no dchar for them so no problem to start with. > > s.find(c) > "Find code unit c in string s" [...] This is a ridiculous argument. We might as well say, "there's no single byte UTF-8 that can represent Ш, so that's no problem to start with" -- since we can just define it away by saying s.find(c) == "find byte c in string s", and thereby justify using ASCII as our standard string representation. The point is that dchar is NOT ENOUGH TO REPRESENT A SINGLE CHARACTER in the general case. It is adequate for a subset of characters -- just like ASCII is also adequate for a subset of characters. If you only need to work with ASCII, it suffices to work with ubyte[]. Similarly, if your work is restricted to only languages without combining diacritics, then a range of dchar suffices. But a range of dchar is NOT good enough in the general case, and arguing that it does only makes you look like a fool. Appealing to normalization doesn't change anything either, since only a subset of base character + diacritic combinations will normalize to a single code point. If the string has a base character + diacritic combination doesn't have a precomposed code point, it will NOT fit in a dchar. (And keep in mind that the notion of diacritic is still very Euro-centric. In Korean, for example, a single character is composed of multiple parts, each of which occupies 1 code point. While some precomposed combinations do exist, they don't cover all of the possibilities, so normalization won't help you there.) T -- Frank disagreement binds closer than feigned agreement.
Re: The Case Against Autodecode
On Thursday, 2 June 2016 at 21:00:17 UTC, tsbockman wrote: However, this document is very old - from Unicode 3.0 and the year 2000: While there are no surrogate characters in Unicode 3.0 (outside of private use characters), future versions of Unicode will contain them... Perhaps level 1 has since been redefined? I found the latest (unofficial) draft version: http://www.unicode.org/reports/tr18/tr18-18.html Relevant changes: * Level 1 is to be redefined as working on code points, not code units: A fundamental requirement is that Unicode text be interpreted semantically by code point, not code units. * Level 2 (graphemes) is explicitly described as a "default level": This is still a default level—independent of country or language—but provides much better support for end-user expectations than the raw level 1... * All mention of level 2 being slow has been removed. The only reason given for making it toggle-able is for compatibility with level 1 algorithms: Level 2 support matches much more what user expectations are for sequences of Unicode characters. It is still locale-independent and easily implementable. However, for compatibility with Level 1, it is useful to have some sort of syntax that will turn Level 2 support on and off.
Re: The Case Against Autodecode
On 6/2/2016 3:27 PM, John Colvin wrote: I wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness. There are languages that make heavy use of diacritics, often several on a single "character". Hebrew is a good example. Should there be only one valid ordering of any given set of diacritics on any given character? I didn't say ordering, I said there should be no such thing as "normalization" in Unicode, where two codepoints are considered to be identical to some other codepoint.
Re: The Case Against Autodecode
On 6/2/2016 2:25 PM, deadalnix wrote: On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote: I wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness. To be able to convert back and forth from/to unicode in a lossless manner. Sorry, that makes no sense, as it is saying "they're the same, only different."
Re: The Case Against Autodecode
On 6/2/2016 4:29 PM, Jonathan M Davis via Digitalmars-d wrote: How do you suggest that we handle the normalization issue? Started a new thread for that one.
Re: The Case Against Autodecode
On Thursday, June 02, 2016 15:48:03 Walter Bright via Digitalmars-d wrote: > On 6/2/2016 3:23 PM, Andrei Alexandrescu wrote: > > On 06/02/2016 05:58 PM, Walter Bright wrote: > >> > * s.balancedParens('〈', '〉') works only with autodecoding. > >> > * s.canFind('ö') works only with autodecoding. It returns always > >> > >> false without. > >> > >> Can be made to work without autodecoding. > > > > By special casing? Perhaps. > > The argument to canFind() can be detected as not being a char, then decoded > into a sequence of char's, then forwarded to a substring search. How do you suggest that we handle the normalization issue? Should we just assume NFC like std.uni.normalize does and provide an optional template argument to indicate a different normalization (like normalize does)? Since without providing a way to deal with the normalization, we're not actually making the code fully correct, just faster. - Jonathan M Davis
Re: The Case Against Autodecode
On Thursday, June 02, 2016 22:27:16 John Colvin via Digitalmars-d wrote: > On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote: > > I wonder what rationale there is for Unicode to have two > > different sequences of codepoints be treated as the same. It's > > madness. > > There are languages that make heavy use of diacritics, often > several on a single "character". Hebrew is a good example. Should > there be only one valid ordering of any given set of diacritics > on any given character? It's an interesting idea, but it's not > how things are. Yeah. I'm inclined to think that the fact that there are multiple normalizations was a huge mistake on the part of the Unicode folks, but we're stuck dealing with it. And as horrible as it is for most cases, maybe it _does_ ultimately make sense because of certain use cases; I don't know. But bad idea or not, we're stuck. :( - Jonathan M Davis
Re: The Case Against Autodecode
On Thursday, June 02, 2016 18:23:19 Andrei Alexandrescu via Digitalmars-d wrote: > On 06/02/2016 05:58 PM, Walter Bright wrote: > > On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote: > >> The lambda returns bool. -- Andrei > > > > Yes, I was wrong about that. But the point still stands with: > > > * s.balancedParens('〈', '〉') works only with autodecoding. > > > * s.canFind('ö') works only with autodecoding. It returns always > > > > false without. > > > > Can be made to work without autodecoding. > > By special casing? Perhaps. I seem to recall though that one major issue > with autodecoding was that it special-cases certain algorithms. So you'd > need to go through all of std.algorithm and make sure you can > special-case your way out of situations that work today. Yeah, I believe that you do have to do some special casing, though it would be special casing on ranges of code units in general and not strings specifically, and a lot of those functions are already special cased on string in an attempt be efficient. In particular, with a function like find or canFind, you'd take the needle and encode it to match the haystack it was passed so that you can do the comparisons via code units. So, you incur the encoding cost once when encoding the needle rather than incurring the decoding cost of each code point or grapheme as you iterate over the haystack. So, you end up with something that's correct and efficient. It's also much friendlier to code that only operates on ASCII. The one issue that I'm not quite sure how we'd handle in that case is normalization (which auto-decoding doesn't handle either), since you'd need to normalize the needle to match the haystack (which also assumes that the haystack was already normalized). Certainly, it's the sort of thing that makes it so that you kind of wish you were dealing with a string type that had the normalization built into it rather than either an array of code units or an arbitrary range of code units. But maybe we could assume the NFC normalization like std.uni.normalize does and provide an optional template argument for the normalization scheme. In any case, while it's not entirely straightforward, it is quite possible to write some algorithms in a way which works on arbitrary ranges of code units and deals with Unicode correctly without auto-decoding or requiring that the user convert it to a range of code points or graphemes in order to properly handle the full range of Unicode. And even if we keep auto-decoding, we pretty much need to fix it so that std.algorithm and friends are Unicode-aware in this manner so that ranges of code units work in general without requiring that you use byGrapheme. So, this sort of thing could have a large impact on RCStr, even if we keep auto-decoding for narrow strings. Other algorithms, however, can't be made to work automatically with Unicode - at least not with the current range paradigm. filter, for instance, really needs to operate on graphemes to filter on characters, but with a range of code units, that would mean operating on groups of code units as a single element, which you can't do with something like a range of char, since that essentially becomes a range of ranges. It has to be wrapped in a range that's going to provide graphemes - and of course, if you know that you're operating only on ASCII, then you wouldn't want to deal with graphemes anyway, so automatically converting to graphemes would be undesirable. So, for a function like filter, it really does have to be up to the programmer to indicate what level of Unicode they want to operate at. But if we don't make functions Unicode-aware where possible, then we're going to take a performance hit by essentially forcing everyone to use explicit ranges of code points or graphemes even when they should be unnecessary. So, I think that we're stuck with some level of special casing, but it would then be for ranges of code units and code points and not strings. So, it would work efficiently for stuff like RCStr, which the current scheme does not. I think that the reality of the matter is that regardless of whether we keep auto-decoding for narrow strings in place, we need to make Phobos operate on arbitrary ranges of code units and code points, since even stuff like RCStr won't work efficiently otherwise, and stuff like byCodeUnit won't be usuable in as many cases otherwise, because if a generic function isn't Unicode-aware, then in many cases, byCodeUnit will be very wrong, just like byCodePoint would be wrong. So, as far as Phobos goes, I'm not sure that the question of auto-decoding matters much for what we need to do at this point. If we do what we need to do, then Phobos will work whether we have auto-decoding or not (working in a Unicode-aware manner where possible and forcing the user to decide the correct level of Unicode to work at where not), and then it just becomes a question of whether we can or should deprecate auto-decoding once all that's done. -
Re: The Case Against Autodecode
On Thursday, 2 June 2016 at 21:56:10 UTC, Walter Bright wrote: Yes, you have a good point. But we do allow things like: byte b; if (b == 1) ... Why allowing char/wchar/dchar comparisons is wrong: void main() { string s = "Привет"; foreach (c; s) assert(c != 'Ñ'); } From my post from 2014: http://forum.dlang.org/post/knrwiqxhlvqwxqshy...@forum.dlang.org
Re: The Case Against Autodecode
On 03.06.2016 00:23, Andrei Alexandrescu wrote: On 06/02/2016 05:58 PM, Walter Bright wrote: On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote: The lambda returns bool. -- Andrei Yes, I was wrong about that. But the point still stands with: > * s.balancedParens('〈', '〉') works only with autodecoding. > * s.canFind('ö') works only with autodecoding. It returns always false without. Can be made to work without autodecoding. By special casing? Perhaps. I seem to recall though that one major issue with autodecoding was that it special-cases certain algorithms. The major issue is that it special cases when there's different, more natural semantics available.
Re: The Case Against Autodecode
On 03.06.2016 00:26, Walter Bright wrote: On 6/2/2016 3:11 PM, Timon Gehr wrote: Well, this is a somewhat different case, because 1 is just not representable as a byte. Every value that fits in a byte fits in an int though. It's different for code units. They are incompatible both ways. Not exactly. (c == 'ö') is always false for the same reason that (b == 1000) is always false. ... Yes. And _additionally_, some other concerns apply that are not there for byte vs. int. I.e. if b == 1 is disallowed, then c == d should be disallowed too, but b == 1 can be allowed even if c == d is disallowed. I'm not sure what the right answer is here. char to dchar is a lossy conversion, so it shouldn't happen. byte to int is a lossless conversion, so there is no problem a priori.
Re: The Case Against Autodecode
On Thursday, 2 June 2016 at 22:20:49 UTC, Walter Bright wrote: On 6/2/2016 2:05 PM, tsbockman wrote: Presumably if someone marks their own PR as "do not merge", it means they're planning to either close it themselves after it has served its purpose, or they plan to fix/finish it and then remove the "do not merge" label. That doesn't seem to apply here, either. Either way, they shouldn't be closed just because they say "do not merge" (unless they're abandoned or something, obviously). Something like that could not be merged until 132 other PRs are done to fix Phobos. It doesn't belong as a PR. I was just responding to the general question you posed about "do not merge" PRs, not really arguing for that one, in particular, to be re-opened. I'm sure @wilzbach is willing to explain if anyone cares to ask him why he did it as a PR, though.
Re: The Case Against Autodecode
On 6/2/2016 3:10 PM, Marco Leise wrote: we haven't looked into borrowing/scoped enough That's my fault. As for scoped, the idea is to make scope work analogously to DIP25's 'return ref'. I don't believe we need borrowing, we've worked out another solution that will work for ref counting. Please do not reply to this in this thread - start a new one if you wish to continue with this topic.
Re: The Case Against Autodecode
On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote: On 6/2/2016 12:34 PM, deadalnix wrote: On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote: Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without. False. Many characters can be represented by different sequences of codepoints. For instance, ê can be ê as one codepoint or ^ as a modifier followed by e. ö is one such character. There are 3 levels of Unicode support. What Andrei is talking about is Level 1. http://unicode.org/reports/tr18/tr18-5.1.html I wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness. There are languages that make heavy use of diacritics, often several on a single "character". Hebrew is a good example. Should there be only one valid ordering of any given set of diacritics on any given character? It's an interesting idea, but it's not how things are.
Re: The Case Against Autodecode
On 6/2/2016 3:11 PM, Timon Gehr wrote: Well, this is a somewhat different case, because 1 is just not representable as a byte. Every value that fits in a byte fits in an int though. It's different for code units. They are incompatible both ways. Not exactly. (c == 'ö') is always false for the same reason that (b == 1000) is always false. I'm not sure what the right answer is here.
Re: The Case Against Autodecode
On 06/02/2016 05:58 PM, Walter Bright wrote: On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote: The lambda returns bool. -- Andrei Yes, I was wrong about that. But the point still stands with: > * s.balancedParens('〈', '〉') works only with autodecoding. > * s.canFind('ö') works only with autodecoding. It returns always false without. Can be made to work without autodecoding. By special casing? Perhaps. I seem to recall though that one major issue with autodecoding was that it special-cases certain algorithms. So you'd need to go through all of std.algorithm and make sure you can special-case your way out of situations that work today. Andrei
Re: The Case Against Autodecode
On Thursday, 2 June 2016 at 22:03:01 UTC, default0 wrote: *sigh* reading comprehension. ... Please do not take what I say out of context, thank you. Earlier you said: The level 2 support description noted that it should be opt-in because its slow. My main point is simply that you mischaracterized what the standard says. Making level 1 opt-in, rather than level 2, would be just as compliant as the reverse. The standard makes no suggestion as to which should be default.
Re: The Case Against Autodecode
Am Thu, 2 Jun 2016 15:05:44 -0400 schrieb Andrei Alexandrescu: > On 06/02/2016 01:54 PM, Marc Schütz wrote: > > Which practical tasks are made possible (and work _correctly_) if you > > decode to code points, that don't already work with code units? > > Pretty much everything. > > s.all!(c => c == 'ö') Andrei, your ignorance is really starting to grind on everyones nerves. If after 350 posts you still don't see why this is incorrect: s.any!(c => c == 'o'), you must be actively skipping the informational content of this thread. You are in error, no one agrees with you, and you refuse to see it and in the end we have to assume you will make a decisive vote against any PR with the intent to remove auto-decoding from Phobos. Your so called vocal minority is actually D's panel of Unicode experts who understand that auto-decoding is a false ally and should be on the deprecation track. Remember final-by-default? You promised, that your objection about breaking code means that D2 will only continue to be fixed in a backwards compatible way, be it the implementation of shared or whatever else. Yet months later you opened a thread with the title "inout must go". So that must have been an appeasement back then. People don't forget these things easily and RCStr seems to be a similar distraction, considering we haven't looked into borrowing/scoped enough and you promise wonders from it. -- Marco
Re: The Case Against Autodecode
On 02.06.2016 23:56, Walter Bright wrote: On 6/2/2016 1:12 PM, Timon Gehr wrote: ... It is not meaningful to compare utf-8 and utf-16 code units directly. Yes, you have a good point. But we do allow things like: byte b; if (b == 1) ... Well, this is a somewhat different case, because 1 is just not representable as a byte. Every value that fits in a byte fits in an int though. It's different for code units. They are incompatible both ways. E.g. dchar obviously does not fit in a char, and while the lower half of char is compatible with dchar, the upper half is specific to the encoding. dchar cannot represent upper half char code units. You get the code points with the corresponding values instead. E.g.: void main(){ import std.stdio,std.utf; foreach(dchar d;"ö".byCodeUnit) writeln(d); // "Ã", "¶" }
Re: The Case Against Autodecode
On Thursday, 2 June 2016 at 21:51:51 UTC, tsbockman wrote: On Thursday, 2 June 2016 at 21:38:02 UTC, default0 wrote: On Thursday, 2 June 2016 at 21:30:51 UTC, tsbockman wrote: 1) It does not say that level 2 should be opt-in; it says that level 2 should be toggle-able. Nowhere does it say which of level 1 and 2 should be the default. 2) It says that working with graphemes is slower than UTF-16 code UNITS (level 1), but says nothing about streaming decoding of code POINTS (what we have). 3) That document is from 2000, and its claims about performance are surely extremely out-dated, anyway. Computers and the Unicode standard have both changed much since then. 1) Right because a special toggleable syntax is definitely not "opt-in". It is not "opt-in" unless it is toggled off by default. The only reason it doesn't talk about toggling in the level 1 section, is because that section is written with the assumption that many programs will *only* support level 1. *sigh* reading comprehension. Needing to write .byGrapheme or similar to enable the behaviour qualifies for what that description was arguing for. I hope you understand that now that I am repeating this for you. 2) Several people in this thread noted that working on graphemes is way slower (which makes sense, because its yet another processing you need to do after you decoded - therefore more work - therefore slower) than working on code points. And working on code points is way slower than working on code units (the actual level 1). Never claimed the opposite. Do note however that its specifically talking about UTF-16 code units. 3) Not an argument - doing more work makes code slower. What do you think I'm arguing for? It's not graphemes-by-default. Unrelated. I was refuting the point you made about the relevance of the performance claims of the unicode level 2 support description, not evaluating your hypothetical design. Please do not take what I say out of context, thank you.
Re: The Case Against Autodecode
On 02.06.2016 23:46, Andrei Alexandrescu wrote: On 6/2/16 5:43 PM, Timon Gehr wrote: .̂ ̪.̂ (Copy-paste it somewhere else, I think it might not be rendered correctly on the forum.) The point is that if I do: ".̂ ̪.̂".normalize!NFC.byGrapheme.findAmong([Grapheme("."),Grapheme(",")]) no match is returned. If I use your method with dchars, I will get spurious matches. I.e. the suggested method to look for punctuation symbols is incorrect: writeln(".̂ ̪.̂".findAmong(",.")); // ".̂ ̪.̂" Nice example. ... Thanks! :o) (Also, do you have an use case for this?) Count delimited words. Did you also look at balancedParens? Andrei On 02.06.2016 22:01, Timon Gehr wrote: * s.balancedParens('〈', '〉') works only with autodecoding. ... Doesn't work, e.g. s="⟨⃖". Shouldn't compile. assert("⟨⃖".normalize!NFC.byGrapheme.balancedParens(Grapheme("⟨"),Grapheme("⟩"))); writeln("⟨⃖".balancedParens('⟨','⟩')); // false
Re: The Case Against Autodecode
On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote: The lambda returns bool. -- Andrei Yes, I was wrong about that. But the point still stands with: > * s.balancedParens('〈', '〉') works only with autodecoding. > * s.canFind('ö') works only with autodecoding. It returns always false without. Can be made to work without autodecoding.
Re: The Case Against Autodecode
On 6/2/2016 1:12 PM, Timon Gehr wrote: On 02.06.2016 22:07, Walter Bright wrote: On 6/2/2016 12:05 PM, Andrei Alexandrescu wrote: * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without. The o is inferred as a wchar. The lamda then is inferred to return a wchar. No, the lambda returns a bool. Thanks for the correction. The algorithm can check that the input is char[], and is being tested against a wchar. Therefore, the algorithm can specialize to do the decoding itself. No autodecoding necessary, and it does the right thing. It still would not be the right thing. The lambda shouldn't compile. It is not meaningful to compare utf-8 and utf-16 code units directly. Yes, you have a good point. But we do allow things like: byte b; if (b == 1) ...
Re: The Case Against Autodecode
On Thursday, 2 June 2016 at 21:38:02 UTC, default0 wrote: On Thursday, 2 June 2016 at 21:30:51 UTC, tsbockman wrote: 1) It does not say that level 2 should be opt-in; it says that level 2 should be toggle-able. Nowhere does it say which of level 1 and 2 should be the default. 2) It says that working with graphemes is slower than UTF-16 code UNITS (level 1), but says nothing about streaming decoding of code POINTS (what we have). 3) That document is from 2000, and its claims about performance are surely extremely out-dated, anyway. Computers and the Unicode standard have both changed much since then. 1) Right because a special toggleable syntax is definitely not "opt-in". It is not "opt-in" unless it is toggled off by default. The only reason it doesn't talk about toggling in the level 1 section, is because that section is written with the assumption that many programs will *only* support level 1. 2) Several people in this thread noted that working on graphemes is way slower (which makes sense, because its yet another processing you need to do after you decoded - therefore more work - therefore slower) than working on code points. And working on code points is way slower than working on code units (the actual level 1). 3) Not an argument - doing more work makes code slower. What do you think I'm arguing for? It's not graphemes-by-default. What I actually want to see: permanently deprecate the auto-decoding range primitives. Force the user to explicitly specify whichever of `by!dchar`, `byCodePoint`, or `byGrapheme` their specific algorithm actually needs. Removing the implicit conversions between `char`, `wchar`, and `dchar` would also be nice, but isn't really necessary I think. That would be a standards-compliant solution (one of several possible). What we have now is non-standard, at least going by the old version Walter linked.
Re: The Case Against Autodecode
On 6/2/16 5:43 PM, Timon Gehr wrote: .̂ ̪.̂ (Copy-paste it somewhere else, I think it might not be rendered correctly on the forum.) The point is that if I do: ".̂ ̪.̂".normalize!NFC.byGrapheme.findAmong([Grapheme("."),Grapheme(",")]) no match is returned. If I use your method with dchars, I will get spurious matches. I.e. the suggested method to look for punctuation symbols is incorrect: writeln(".̂ ̪.̂".findAmong(",.")); // ".̂ ̪.̂" Nice example. (Also, do you have an use case for this?) Count delimited words. Did you also look at balancedParens? Andrei
Re: The Case Against Autodecode
On 6/2/16 5:38 PM, cym13 wrote: Allow me to try another angle: - There are different levels of unicode support and you don't want to support them all transparently. That's understandable. Cool. - The level you choose to support is the code point level. There are many good arguments about why this isn't a good default but you won't change your mind. I don't like that at all and I'm not alone but let's forget the entirety of the vocal D community for a moment. You mean all 35 of them? It's not about changing my mind! A massive thing that the code point level handling is the incumbent, and that changing it would need to mark an absolutely Earth-shattering improvement to be worth it! - A huge part of unicode chars can be normalized to fit your definition. That way not everything work (far from it) but a sufficiently big subset works. Cool. - On the other hand without normalization it just doesn't make any sense from a user perspective.The ö example has clearly shown that much, you even admitted it yourself by stating that many counter arguments would have worked had the string been normalized). Yah, operating at code point level does not come free of caveats. It is vastly superior to operating on code units, and did I mention it's the incumbent. - The most proeminent problem is with graphems that can have different representations as those that can't be normalized can't be searched as dchars as well. Yah, I'd say if the program needs graphemes the option is there. Phobos by default deals with code points which are not perfect but are independent of representation, produce meaningful and consistent results with std.algorithm etc. - If autodecoding to code points is to stay and in an effort to find a compromise then normalizing should be done by default. Sure it would take some more time but it wouldn't break any code (I think) and would actually make things more correct. They still wouldn't be correct but I feel that something as crazy as unicode cannot be tackled generically anyway. Some more work on normalization at strategic points in Phobos would be interesting! Andrei
Re: The Case Against Autodecode
On 02.06.2016 23:23, Andrei Alexandrescu wrote: On 6/2/16 5:19 PM, Timon Gehr wrote: On 02.06.2016 23:16, Timon Gehr wrote: On 02.06.2016 23:06, Andrei Alexandrescu wrote: As the examples show, the examples would be entirely meaningless at code unit level. So far, I needed to count the number of characters 'ö' inside some string exactly zero times, (Obviously this isn't even what the example would do. I predict I will never need to count the number of code points 'ö' by calling some function from std.algorithm directly.) You may look for a specific dchar, and it'll work. How about findAmong("...") with a bunch of ASCII and Unicode punctuation symbols? -- Andrei .̂ ̪.̂ (Copy-paste it somewhere else, I think it might not be rendered correctly on the forum.) The point is that if I do: ".̂ ̪.̂".normalize!NFC.byGrapheme.findAmong([Grapheme("."),Grapheme(",")]) no match is returned. If I use your method with dchars, I will get spurious matches. I.e. the suggested method to look for punctuation symbols is incorrect: writeln(".̂ ̪.̂".findAmong(",.")); // ".̂ ̪.̂" (Also, do you have an use case for this?)
Re: The Case Against Autodecode
On 6/2/16 5:38 PM, deadalnix wrote: On Thursday, 2 June 2016 at 21:37:11 UTC, Andrei Alexandrescu wrote: On 6/2/16 5:35 PM, deadalnix wrote: On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote: On 6/2/16 5:20 PM, deadalnix wrote: The good thing when you define works by whatever it does right now No, it works as it was designed. -- Andrei Nobody says it doesn't. Everybody says the design is crap. I think I like it more after this thread. -- Andrei You start reminding me of the joke with that guy complaining that everybody is going backward on the highway. Touché. (Get it?) -- Andrei
Re: The Case Against Autodecode
On 6/2/16 5:37 PM, Andrei Alexandrescu wrote: On 6/2/16 5:35 PM, deadalnix wrote: On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote: On 6/2/16 5:20 PM, deadalnix wrote: The good thing when you define works by whatever it does right now No, it works as it was designed. -- Andrei Nobody says it doesn't. Everybody says the design is crap. I think I like it more after this thread. -- Andrei Meh, thinking of it again: I don't like it more, I'd still do it differently given a clean slate (viz. RCStr). But let's say I didn't get many compelling reasons to remove autodecoding from this thread. -- Andrei
Re: The Case Against Autodecode
On Thursday, 2 June 2016 at 20:29:48 UTC, Andrei Alexandrescu wrote: On 06/02/2016 04:22 PM, cym13 wrote: A:“We should decode to code points” B:“No, decoding to code points is a stupid idea.” A:“No it's not!” B:“Can you show a concrete example where it does something useful?” A:“Sure, look at that!” B:“This isn't working at all, look at all those counter-examples!” A:“It may not work for your examples but look how easy it is to find code points!” With autodecoding all of std.algorithm operates correctly on code points. Without it all it does for strings is gibberish. -- Andrei Allow me to try another angle: - There are different levels of unicode support and you don't want to support them all transparently. That's understandable. - The level you choose to support is the code point level. There are many good arguments about why this isn't a good default but you won't change your mind. I don't like that at all and I'm not alone but let's forget the entirety of the vocal D community for a moment. - A huge part of unicode chars can be normalized to fit your definition. That way not everything work (far from it) but a sufficiently big subset works. - On the other hand without normalization it just doesn't make any sense from a user perspective.The ö example has clearly shown that much, you even admitted it yourself by stating that many counter arguments would have worked had the string been normalized). - The most proeminent problem is with graphems that can have different representations as those that can't be normalized can't be searched as dchars as well. - If autodecoding to code points is to stay and in an effort to find a compromise then normalizing should be done by default. Sure it would take some more time but it wouldn't break any code (I think) and would actually make things more correct. They still wouldn't be correct but I feel that something as crazy as unicode cannot be tackled generically anyway.
Re: The Case Against Autodecode
On Thursday, 2 June 2016 at 21:37:11 UTC, Andrei Alexandrescu wrote: On 6/2/16 5:35 PM, deadalnix wrote: On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote: On 6/2/16 5:20 PM, deadalnix wrote: The good thing when you define works by whatever it does right now No, it works as it was designed. -- Andrei Nobody says it doesn't. Everybody says the design is crap. I think I like it more after this thread. -- Andrei You start reminding me of the joke with that guy complaining that everybody is going backward on the highway.
Re: The Case Against Autodecode
On Thursday, 2 June 2016 at 21:30:51 UTC, tsbockman wrote: On Thursday, 2 June 2016 at 21:07:19 UTC, default0 wrote: The level 2 support description noted that it should be opt-in because its slow. 1) It does not say that level 2 should be opt-in; it says that level 2 should be toggle-able. Nowhere does it say which of level 1 and 2 should be the default. 2) It says that working with graphemes is slower than UTF-16 code UNITS (level 1), but says nothing about streaming decoding of code POINTS (what we have). 3) That document is from 2000, and its claims about performance are surely extremely out-dated, anyway. Computers and the Unicode standard have both changed much since then. 1) Right because a special toggleable syntax is definitely not "opt-in". 2) Several people in this thread noted that working on graphemes is way slower (which makes sense, because its yet another processing you need to do after you decoded - therefore more work - therefore slower) than working on code points. 3) Not an argument - doing more work makes code slower. The only thing that changes is what specific operations have what cost (for instance, memory access has a much higher cost now than it had then). Considering the way the process works and judging from what others in this thread have said about it, I will stick with "always decoding to graphemes for all operations is very slow" and indulge in being too lazy to write benchmarks for it to show just how bad it is.
Re: The Case Against Autodecode
On 6/2/16 5:35 PM, deadalnix wrote: On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote: On 6/2/16 5:20 PM, deadalnix wrote: The good thing when you define works by whatever it does right now No, it works as it was designed. -- Andrei Nobody says it doesn't. Everybody says the design is crap. I think I like it more after this thread. -- Andrei
Re: The Case Against Autodecode
On 6/2/16 5:35 PM, ag0aep6g wrote: On 06/02/2016 11:27 PM, Andrei Alexandrescu wrote: On 6/2/16 5:24 PM, ag0aep6g wrote: On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote: Nope, that's a radically different matter. As the examples show, the examples would be entirely meaningless at code unit level. They're simply not possible. Won't compile. They do compile. Yes, you're right, of course they do. char implicitly converts to dchar. I didn't think of that anti-feature. As I said: this thread produces an unpleasant amount of arguments in favor of autodecoding. Even I don't like that :o). It's more of an argument against char : dchar, I'd say. I do think that's an interesting option in PL design space, but that would be super disruptive. -- Andrei
Re: The Case Against Autodecode
On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote: On 6/2/16 5:20 PM, deadalnix wrote: The good thing when you define works by whatever it does right now No, it works as it was designed. -- Andrei Nobody says it doesn't. Everybody says the design is crap.
Re: The Case Against Autodecode
On 06/02/2016 11:27 PM, Andrei Alexandrescu wrote: On 6/2/16 5:24 PM, ag0aep6g wrote: On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote: Nope, that's a radically different matter. As the examples show, the examples would be entirely meaningless at code unit level. They're simply not possible. Won't compile. They do compile. Yes, you're right, of course they do. char implicitly converts to dchar. I didn't think of that anti-feature. As I said: this thread produces an unpleasant amount of arguments in favor of autodecoding. Even I don't like that :o). It's more of an argument against char : dchar, I'd say.
Re: The Case Against Autodecode
On Thursday, 2 June 2016 at 21:07:19 UTC, default0 wrote: The level 2 support description noted that it should be opt-in because its slow. 1) It does not say that level 2 should be opt-in; it says that level 2 should be toggle-able. Nowhere does it say which of level 1 and 2 should be the default. 2) It says that working with graphemes is slower than UTF-16 code UNITS (level 1), but says nothing about streaming decoding of code POINTS (what we have). 3) That document is from 2000, and its claims about performance are surely extremely out-dated, anyway. Computers and the Unicode standard have both changed much since then.
Re: The Case Against Autodecode
On 6/2/16 5:27 PM, Andrei Alexandrescu wrote: On 6/2/16 5:24 PM, ag0aep6g wrote: Just like there is no single code point for 'a⃗' so you can't search for it in a range of code points. Of course you can. Correx, indeed you can't. -- Andrei
Re: The Case Against Autodecode
On 02.06.2016 22:51, Andrei Alexandrescu wrote: On 06/02/2016 04:50 PM, Timon Gehr wrote: On 02.06.2016 22:28, Andrei Alexandrescu wrote: On 06/02/2016 04:12 PM, Timon Gehr wrote: It is not meaningful to compare utf-8 and utf-16 code units directly. But it is meaningful to compare Unicode code points. -- Andrei It is also meaningful to compare two utf-8 code units or two utf-16 code units. By decoding them of course. -- Andrei That makes no sense, I cannot decode single code units. BTW, I guess the reason why char converts to wchar converts to dchar is that the lower half of code units in char and the lower half of code units in wchar are code points. Maybe code units and code points with low numerical values should have distinct types.
Re: The Case Against Autodecode
On 6/2/16 5:20 PM, deadalnix wrote: The good thing when you define works by whatever it does right now No, it works as it was designed. -- Andrei
Re: The Case Against Autodecode
On 6/2/16 5:23 PM, Timon Gehr wrote: On 02.06.2016 22:51, Andrei Alexandrescu wrote: On 06/02/2016 04:50 PM, Timon Gehr wrote: On 02.06.2016 22:28, Andrei Alexandrescu wrote: On 06/02/2016 04:12 PM, Timon Gehr wrote: It is not meaningful to compare utf-8 and utf-16 code units directly. But it is meaningful to compare Unicode code points. -- Andrei It is also meaningful to compare two utf-8 code units or two utf-16 code units. By decoding them of course. -- Andrei That makes no sense, I cannot decode single code units. BTW, I guess the reason why char converts to wchar converts to dchar is that the lower half of code units in char and the lower half of code units in wchar are code points. Maybe code units and code points with low numerical values should have distinct types. Then you lost me. (I'm sure you're making a good point.) -- Andrei
Re: The Case Against Autodecode
On 02.06.2016 23:20, deadalnix wrote: The sample code won't count the instance of the grapheme 'ö' as some of its encoding won't be counted, which definitively count as doesn't work. It also has false positives (you can combine 'ö' with some combining character in order to get some strange character that is not an 'ö', and not even NFC helps with that).
Re: The Case Against Autodecode
On 6/2/16 5:24 PM, ag0aep6g wrote: On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote: Nope, that's a radically different matter. As the examples show, the examples would be entirely meaningless at code unit level. They're simply not possible. Won't compile. They do compile. There is no single UTF-8 code unit for 'ö', so you can't (easily) search for it in a range for code units. Of course you can. Can you search for an int in a short[]? Oh yes you can. Can you search for a dchar in a char[]? Of course you can. Autodecoding also gives it meaning. Just like there is no single code point for 'a⃗' so you can't search for it in a range of code points. Of course you can. You can still search for 'a', and 'o', and the rest of ASCII in a range of code units. You can search for a dchar in a char[] because you can compare an individual dchar with either another dchar (correct, autodecoding) or with a char (incorrect, no autodecoding). As I said: this thread produces an unpleasant amount of arguments in favor of autodecoding. Even I don't like that :o). Andrei
Re: The Case Against Autodecode
On 06/02/2016 11:24 PM, ag0aep6g wrote: They're simply not possible. Won't compile. There is no single UTF-8 code unit for 'ö', so you can't (easily) search for it in a range for code units. Just like there is no single code point for 'a⃗' so you can't search for it in a range of code points. You can still search for 'a', and 'o', and the rest of ASCII in a range of code units. I'm ignoring combining characters there. You can search for 'a' in code units in the same way that you can search for 'ä' in code points. I.e., more or less, depending on how serious you are about combining characters.
Re: The Case Against Autodecode
On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote: On 6/2/2016 12:34 PM, deadalnix wrote: On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote: Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without. False. Many characters can be represented by different sequences of codepoints. For instance, ê can be ê as one codepoint or ^ as a modifier followed by e. ö is one such character. There are 3 levels of Unicode support. What Andrei is talking about is Level 1. http://unicode.org/reports/tr18/tr18-5.1.html I wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness. To be able to convert back and forth from/to unicode in a lossless manner.
Re: The Case Against Autodecode
On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote: Nope, that's a radically different matter. As the examples show, the examples would be entirely meaningless at code unit level. They're simply not possible. Won't compile. There is no single UTF-8 code unit for 'ö', so you can't (easily) search for it in a range for code units. Just like there is no single code point for 'a⃗' so you can't search for it in a range of code points. You can still search for 'a', and 'o', and the rest of ASCII in a range of code units.
Re: The Case Against Autodecode
On 6/2/16 5:19 PM, Timon Gehr wrote: On 02.06.2016 23:16, Timon Gehr wrote: On 02.06.2016 23:06, Andrei Alexandrescu wrote: As the examples show, the examples would be entirely meaningless at code unit level. So far, I needed to count the number of characters 'ö' inside some string exactly zero times, (Obviously this isn't even what the example would do. I predict I will never need to count the number of code points 'ö' by calling some function from std.algorithm directly.) You may look for a specific dchar, and it'll work. How about findAmong("...") with a bunch of ASCII and Unicode punctuation symbols? -- Andrei
Re: The Case Against Autodecode
On Thursday, 2 June 2016 at 20:13:52 UTC, Andrei Alexandrescu wrote: On 06/02/2016 03:34 PM, deadalnix wrote: On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote: Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without. False. True. "Are all code points equal to this one?" -- Andrei The good thing when you define works by whatever it does right now, it is that everything always works and there are literally never any bug. The bad thing is that this is a completely useless definition of work. The sample code won't count the instance of the grapheme 'ö' as some of its encoding won't be counted, which definitively count as doesn't work. When your point need to redefine words in ways that nobody agree with, it is time to admit the point is bogus.
Re: The Case Against Autodecode
On 02.06.2016 23:06, Andrei Alexandrescu wrote: As the examples show, the examples would be entirely meaningless at code unit level. So far, I needed to count the number of characters 'ö' inside some string exactly zero times, but I wanted to chain or join strings relatively often.
Re: The Case Against Autodecode
On 6/2/16 5:05 PM, tsbockman wrote: On Thursday, 2 June 2016 at 20:56:26 UTC, Walter Bright wrote: What is supposed to be done with "do not merge" PRs other than close them? Occasionally people need to try something on the auto tester (not sure if that's relevant to that particular PR, though). Presumably if someone marks their own PR as "do not merge", it means they're planning to either close it themselves after it has served its purpose, or they plan to fix/finish it and then remove the "do not merge" label. Feel free to reopen if it helps, it wasn't closed in anger. -- Andrei
Re: The Case Against Autodecode
On Thursday, 2 June 2016 at 20:52:29 UTC, ag0aep6g wrote: On 06/02/2016 10:36 PM, Andrei Alexandrescu wrote: By whom? The "support level 1" folks yonder at the Unicode standard? :o) -- Andrei Do they say that level 1 should be the default, and do they give a rationale for that? Would you kindly link or quote that? The level 2 support description noted that it should be opt-in because its slow. Arguably it should be easier to operate on code units if you know its safe to do so, but either always working on code units or always working on graphemes as the default seems to be either too broken too often or too slow too often. Now one can argue either consistency for code units (because then we can treat char[] and friends as a slice) or correctness for graphemes but really the more I think about it the more I think there is no good default and you need to learn unicode anyways. The only sad parts here are that 1) we hijacked an array type for strings, which sucks and 2) that we dont have an api that is actually good at teaching the user what it does and doesnt do. The consequence of 1 is that generic code that also wants to deal with strings will want to special-case to get rid of auto-decoding, the consequence of 2 is that we will have tons of not-actually-correct string handling. I would assume that almost all string handling code that is out in the wild is broken anyways (in code I have encountered I have never seen attempts to normalize or do other things before or after comparisons, searching, etc), unless of course, YOU or one of your colleagues wrote it (consider that checking the length of a string in Java or C# to validate it is no longer than X characters is often done and wrong, because .Length is the number of UTF-16 code units in those languages) :o) So really as bad and alarming as "incorrect string handling" by default seems, it in practice of other languages that get used way more than D has not prevented people from writing working (internationalized!) applications in those languages. One could say we should do it better than them, but I would be inclined to believe that RCStr provides our opportunity to do so. Having char[] be what it is is an annoying wart, and maybe at some point we can deprecate/remove that behaviour, but for now Id rather see if RCStr is viable than attempt to change semantics of all string handling code in D.
Re: The Case Against Autodecode
On Thursday, 2 June 2016 at 20:56:26 UTC, Walter Bright wrote: What is supposed to be done with "do not merge" PRs other than close them? Occasionally people need to try something on the auto tester (not sure if that's relevant to that particular PR, though). Presumably if someone marks their own PR as "do not merge", it means they're planning to either close it themselves after it has served its purpose, or they plan to fix/finish it and then remove the "do not merge" label. Either way, they shouldn't be closed just because they say "do not merge" (unless they're abandoned or something, obviously).
Re: The Case Against Autodecode
On 6/2/16 5:01 PM, ag0aep6g wrote: On 06/02/2016 10:50 PM, Andrei Alexandrescu wrote: It does not fall apart for code points. Yes it does. You've been given plenty examples where it falls apart. There weren't any. Your answer to that was that it operates on code points, not graphemes. That is correct. Well, duh. Comparing UTF-8 code units against each other works, too. That's not an argument for doing that by default. Nope, that's a radically different matter. As the examples show, the examples would be entirely meaningless at code unit level. Andrei
Re: The Case Against Autodecode
On 06/02/2016 10:50 PM, Andrei Alexandrescu wrote: It does not fall apart for code points. Yes it does. You've been given plenty examples where it falls apart. Your answer to that was that it operates on code points, not graphemes. Well, duh. Comparing UTF-8 code units against each other works, too. That's not an argument for doing that by default.
Re: The Case Against Autodecode
On Thursday, 2 June 2016 at 20:49:52 UTC, Andrei Alexandrescu wrote: On 06/02/2016 04:47 PM, tsbockman wrote: That doesn't sound like much of an endorsement for defaulting to only level 1 support to me - "it does not handle more complex languages or extensions to the Unicode Standard very well". Code point/Level 1 support sounds like a sweet spot between efficiency/complexity and conviviality. Level 2 is opt-in with byGrapheme. -- Andrei Actually, according to the document Walter Bright linked level 1 does NOT operate at the code point level: Level 1: Basic Unicode Support. At this level, the regular expression engine provides support for Unicode characters as basic 16-bit logical units. (This is independent of the actual serialization of Unicode as UTF-8, UTF-16BE, UTF-16LE, or UTF-32.) ... Level 1 support works well in many circumstances. However, it does not handle more complex languages or extensions to the Unicode Standard very well. Particularly important cases are **surrogates** ... So, level 1 appears to be UTF-16 code units, not code points. To do code points it would have to recognize surrogates, which are specifically mentioned as not supported. Level 2 skips straight to graphemes, and there is no code point level. However, this document is very old - from Unicode 3.0 and the year 2000: While there are no surrogate characters in Unicode 3.0 (outside of private use characters), future versions of Unicode will contain them... Perhaps level 1 has since been redefined?
Re: The Case Against Autodecode
On Thursday, 2 June 2016 at 20:56:26 UTC, Walter Bright wrote: What is supposed to be done with "do not merge" PRs other than close them? Experimentally iterate until something workable comes about. This way it's done publicly and people can collaborate.
Re: The Case Against Autodecode
On 6/2/2016 1:46 PM, Adam D. Ruppe wrote: The compiler can help you with that. That's the point of the do not merge PR: it got an actionable list out of the compiler and proved the way forward was viable. What is supposed to be done with "do not merge" PRs other than close them?
Re: The Case Against Autodecode
On 06/02/2016 04:52 PM, ag0aep6g wrote: On 06/02/2016 10:36 PM, Andrei Alexandrescu wrote: By whom? The "support level 1" folks yonder at the Unicode standard? :o) -- Andrei Do they say that level 1 should be the default, and do they give a rationale for that? Would you kindly link or quote that? No, but that sounds agreeable to me, especially since it breaks no code of ours. We really should document this better. Kudos to Walter for finding all that Level 1 support. Andrei