subject:"The Case Against Autodecode"

Re: The Case Against Autodecode

2016-06-05 Thread Walter Bright via Digitalmars-d


On 6/5/2016 1:05 AM, deadalnix wrote:

TIL: books are read by computers.


I should introduce you to a fabulous technology called OCR. :-)

Re: The Case Against Autodecode

2016-06-05 Thread Walter Bright via Digitalmars-d


On 6/5/2016 1:07 AM, deadalnix wrote:

On Saturday, 4 June 2016 at 03:03:16 UTC, Walter Bright wrote:

Oh rubbish. Let go of the idea that choosing bad fonts should drive Unicode
codepoint decisions.



Interestingly enough, I've mentioned earlier here that only people from the US
would believe that documents with mixed languages aren't commonplace. I wasn't
expecting to be proven right that fast.



You'd be in error. I've been casually working on my grandfather's thesis trying 
to make a web version of it, and it is mixed German, French, and English. I've 
also made a digital version of an old history book that is mixed English, old 
English, German, French, Greek, old Greek, and Egyptian hieroglyphs (available 
on Amazons in your neighborhood!).


I've also lived in Germany for 3 years, though that was before computers took 
over the world.

Re: The Case Against Autodecode

2016-06-05 Thread docandrew via Digitalmars-d


On Saturday, 4 June 2016 at 08:12:47 UTC, Walter Bright wrote:

On 6/3/2016 11:17 PM, H. S. Teoh via Digitalmars-d wrote:
On Fri, Jun 03, 2016 at 08:03:16PM -0700, Walter Bright via 
Digitalmars-d wrote:

It works for books.

Because books don't allow their readers to change the font.


Unicode is not the font.


This madness already exists *without* Unicode. If you have a 
page with a
single glyph 'm' printed on it and show it to an English 
speaker, he
will say it's lowercase M. Show it to a Russian speaker, and 
he will say

it's lowercase Т.  So which letter is it, M or Т?


It's not a problem that Unicode can solve. As you said, the 
meaning is in the context. Unicode has no context, and tries to 
solve something it cannot.


('m' doesn't always mean m in english, either. It depends on 
the context.)


Ya know, if Unicode actually solved these problems, you'd have 
a case. But it doesn't, and so you don't :-)



If you're going to represent both languages, you cannot get 
away from

needing to represent letters abstractly, rather than visually.


Books do visually just fine!


So should O and 0 share the same glyph or not? They're 
visually the same

thing,


No, they're not. Not even on old typewriters where every key 
was expensive. Even without the slash, the O tends to be fatter 
than the 0.



The very fact that we distinguish between O and 0, 
independently of what

Unicode did/does, is already proof enough that going by visual
representation is inadequate.


Except that you right now are using a font where they are 
different enough that you have no trouble at all distinguishing 
them without bothering to look it up. And so am I.



In other words toUpper and toLower does not belong in the 
standard

library. Great.


Unicode and the standard library are two different things.


Even if a character in different languages share a glyph or look 
identical though, it makes sense to duplicate them with different 
code points/units/whatever.


Simple functions like isCyrillicLetter() can then do a simple 
less-than / greater-than comparison instead of having a lookup 
table to check different numeric representations scattered 
throughout the Unicode table. Functions like toUpper and toLower 
become easier to write as well (for SOME languages anyhow), it's 
simply myletter +/- numlettersinalphabet. Redundancy here is very 
helpful.


Maybe instead of Unicode they should have called it Babel... :)

"The Lord said, “If as one people speaking the same language they 
have begun to do this, then nothing they plan to do will be 
impossible for them. Come, let us go down and confuse their 
language so they will not understand each other.”"


-Jon

Re: The Case Against Autodecode

2016-06-05 Thread Jonathan M Davis via Digitalmars-d

On Friday, June 03, 2016 15:38:38 Walter Bright via Digitalmars-d wrote:
> On 6/3/2016 2:10 PM, Jonathan M Davis via Digitalmars-d wrote:
> > Actually, I would argue that the moment that Unicode is concerned with
> > what
> > the character actually looks like rather than what character it logically
> > is that it's gone outside of its charter. The way that characters
> > actually look is far too dependent on fonts, and aside from display code,
> > code does not care one whit what the character looks like.
>
> What I meant was pretty clear. Font is an artistic style that does not
> change context nor semantic meaning. If a font choice changes the meaning
> then it is not a font.

Well, maybe I misunderstood what was being argued, but it seemed like you've
been arguing that two characters should be considered the same just because
they look similar, whereas H. S. Teoh is arguing that two characters can be
logically distinct while still looking similar and that they should be
treated as distinct in Unicode because they're logically distinct. And if
that's what's being argued, then I agree with H. S. Teoh.

I expect - at least ideally - for Unicode to contain identifiers for
characters that are distinct from whatever their visual representation might
be. Stuff like fonts then worries about how to display them, and hopefully
don't do stupid stuff like make a capital I look like a lowercase l (though
they often do, unfortunately). But if two characters in different scripts -
be they latin and cyrillic or whatever - happen to often look the same but
would be considered two different characters by humans, then I would expect
Unicode to consider them to be different, whereas if no one would reasonably
consider them to be anything but exactly the same character, then there
should only be one character in Unicode.

However, if we really have crazy stuff where subtly different visual
representations of the letter g are considered to be one character in
English and two in Russian, then maybe those should be three different
characters in Unicode so that the English text can clearly be operating on
g, whereas the Russian text is doing whatever it does with its two
characters that happen to look like g. I don't know. That sort of thing just
gets ugly. But I definitely think that Unicode characters should be made up
of what the logical characters are and leave the visual representation up to
the fonts and the like.

Now, how to deal with uppercase vs lowercase and all of that sort of stuff
is a completely separate issue IMHO, and that comes down to how the
characters are somehow logically associated with one another, and it's going
to be very locale-specific such that it's not really part of the core of
Unicode's charter IMHO (though I'm not sure that it's bad if there's a set
of locale rules that go along with Unicode for those looking to correctly
apply such rules - they just have nothing to do with code points and
graphemes and how they're represented in code).

- Jonathan M Davis

Re: The Case Against Autodecode

2016-06-05 Thread deadalnix via Digitalmars-d


On Saturday, 4 June 2016 at 03:03:16 UTC, Walter Bright wrote:
Oh rubbish. Let go of the idea that choosing bad fonts should 
drive Unicode codepoint decisions.




Interestingly enough, I've mentioned earlier here that only 
people from the US would believe that documents with mixed 
languages aren't commonplace. I wasn't expecting to be proven 
right that fast.

Re: The Case Against Autodecode

2016-06-05 Thread deadalnix via Digitalmars-d


On Friday, 3 June 2016 at 18:43:07 UTC, Walter Bright wrote:

On 6/3/2016 9:28 AM, H. S. Teoh via Digitalmars-d wrote:
Eventually you have no choice but to encode by logical meaning 
rather
than by appearance, since there are many lookalikes between 
different
languages that actually mean something completely different, 
and often

behaves completely differently.


It's almost as if printed documents and books have never 
existed!


TIL: books are read by computers.

Re: The Case Against Autodecode

2016-06-05 Thread deadalnix via Digitalmars-d


On Friday, 3 June 2016 at 12:04:39 UTC, Chris wrote:

I do exactly this. Validate and normalize.


And once you've done this, auto decoding is useless because the 
same character has the same representation anyway.

Re: The Case Against Autodecode

2016-06-04 Thread Alix Pexton via Digitalmars-d


On 03/06/2016 20:12, Dmitry Olshansky wrote:

On 02-Jun-2016 23:27, Walter Bright wrote:



I wonder what rationale there is for Unicode to have two different
sequences of codepoints be treated as the same. It's madness.


Yeah, Unicode was not meant to be easy it seems. Or this is whatever
happens with evolutionary design that started with "everything is a
16-bit character".



Typing as someone who as spent some time creating typefaces, having two 
representations makes sense, and it didn't start with Unicode, it 
started with movable type.


It is much easier for a font designer to create the two codepoint 
versions of characters for most instances, i.e. make the base letters 
and the diacritics once. Then what I often do is make single codepoint 
versions of the ones I'm likely to use, but only if they need more 
tweaking than the kerning options of the font format allow. I'll omit 
the history lesson on how this was similar in the case of movable type.


Keyboards for different languages mean that a character that is a single 
keystroke in one case is two together or in sequence in another. This 
means that Unicode not only represents completed strings, but also those 
that are mid composition. The ordering that it uses to ensure that 
graphemes have a single canonical representation is based on the order 
that those multi-key characters are entered. I wouldn't call it elegant, 
but its not inelegant either.


Trying to represent all sufficiently similar glyphs with the same 
codepoint would lead to a layout problem. How would you order them so 
that strings of any language can be sorted by their local sorting rules, 
without having to special case algorithms?


Also consider ligatures, such as those for "ff", "fi", "ffi", "fl", 
"ffl" and many, many more. Typographers create these glyphs whenever 
available kerning tools do a poor job of combining them from the 
individual glyphs. From the point of view of meaning they should still 
be represented as individual codepoints, but for display (electronic or 
print) that sequence needs to be replaced with the single codepoint for 
the ligature.


I think that in order to understand the decisions of the Unicode 
committee, one has to consider that they are trying to unify the 
concerns of representing written information from two sides. One side 
prioritises storage and manipulation, while the other considers 
aesthetics and design workflow more important. My experience of using 
Unicode from both sides gives me a different appreciation for the 
difficulties of reconciling the two.


A...

P.S.

Then they started adding emojis, and I lost all faith in humanity ;)

Re: The Case Against Autodecode

2016-06-04 Thread Walter Bright via Digitalmars-d


On 6/3/2016 11:17 PM, H. S. Teoh via Digitalmars-d wrote:

On Fri, Jun 03, 2016 at 08:03:16PM -0700, Walter Bright via Digitalmars-d wrote:

It works for books.

Because books don't allow their readers to change the font.


Unicode is not the font.



This madness already exists *without* Unicode. If you have a page with a
single glyph 'm' printed on it and show it to an English speaker, he
will say it's lowercase M. Show it to a Russian speaker, and he will say
it's lowercase Т.  So which letter is it, M or Т?


It's not a problem that Unicode can solve. As you said, the meaning is in the 
context. Unicode has no context, and tries to solve something it cannot.


('m' doesn't always mean m in english, either. It depends on the context.)

Ya know, if Unicode actually solved these problems, you'd have a case. But it 
doesn't, and so you don't :-)




If you're going to represent both languages, you cannot get away from
needing to represent letters abstractly, rather than visually.


Books do visually just fine!



So should O and 0 share the same glyph or not? They're visually the same
thing,


No, they're not. Not even on old typewriters where every key was expensive. Even 
without the slash, the O tends to be fatter than the 0.




The very fact that we distinguish between O and 0, independently of what
Unicode did/does, is already proof enough that going by visual
representation is inadequate.


Except that you right now are using a font where they are different enough that 
you have no trouble at all distinguishing them without bothering to look it up. 
And so am I.




In other words toUpper and toLower does not belong in the standard
library. Great.


Unicode and the standard library are two different things.

Re: The Case Against Autodecode

2016-06-04 Thread Patrick Schluter via Digitalmars-d

One has also to take into consideration that Unicode is the way 
it is because it was not invented in an empty space. It had to 
take consideration of the existing and find compromisses allowing 
its adoption. Even if they had invented the perfect encoding, NO 
ONE WOULD HAVE USED IT, as it would have fubar the existing.
As it was invented it allowed a (relatively smooth) transition. 
Here some points that made it even possible that Unicode could be 
adopted at all:
- 16 bits: while that choice was a bit shortsighted, 16 bits is a 
good compromice between compactness and richness (BMP suffice to 
express nearly all living languages).
- Using more or less the same arrangement of codepoints as in the 
different codepages. This allowed to transform legacy documents 
with simple scripts (matter of fact I wrote a script to repair 
misencoded Greek documents, it consisted mainly of  unich = 
ch>0x80 ? ch+0x2D0 : ch;
- Utf-8: this was the genious stroke encoding that allowed to mix 
it all without requiring awful acrobatics (Joakim is completely 
out to lunch on that one, shifting encoding without 
self-synchronisation are hellish, that's why Chinese and Japanese 
adopted Unicode without hesitation, they had enough experience 
with their legacy encodings.

- Letting time for the transition.

So all the points that people here criticize, were in fact the 
reason why Unicode could even be become the standard it is now.

Re: The Case Against Autodecode

2016-06-04 Thread Patrick Schluter via Digitalmars-d


On Friday, 3 June 2016 at 20:53:32 UTC, H. S. Teoh wrote:


Even the Greek sigma has two forms depending on whether it's at 
the end of a word or not -- so should it be two code points or 
one? If you say two, then you'd have a problem with how to 
search for sigma in Greek text, and you'd have to search for 
either medial sigma or final sigma. But if you say one, then 
you'd have a problem with having two different letterforms for 
a single codepoint.


In Unicode there are 2 different codepoints for lower case sigma 
ς U+03C2 and σ U+3C3 but only one uppercase Σ U+3A3 sigma. 
Codepoint U+3A2 is undefined. So your objection is not 
hypothetic, it is actually an issue for uppercase() and 
lowercase() functions.
Another difficulty besides dotless and dotted i of Turkic, the 
double letters used in latin transcription of cyrillic text in 
east and south europe ǆ, ǉ, ǌ and ǳ, which have an uppercase 
forme (Ǆ, Ǉ, Ǌ, Ǳ) and a titlecase form (ǅ, ǈ, ǋ, ǲ).




Besides, that still doesn't solve the problem of what 
"i".uppercase() should return. In most languages, it should 
return "I", but in Turkish it should not.  And if we really 
went the route of encoding Cyrillic letters the same as their 
Latin lookalikes, we'd have a problem with what "m".uppercase() 
should return, because now it depends on which font is in 
effect (if it's a Cyrillic cursive font, the correct answer is 
"Т", if it's a Latin font, the correct answer is "M" -- the 
other combinations: who knows).  That sounds far worse than 
what we have today.


As an anecdote I can tell the story of the accession to the 
European Union of Romania and Bulgaria in 2007. The issue was 
that 3 letters used by Romanian and Bulgarian had been forgotten 
by the Unicode consortium (Ș U+0218, ș U+219, Ț U+21A, ț U+21B 
and 2 Cyrillic letters that I do not remember). The Romanian used 
as a replacement Ş, ş, Ţ and ţ (U+15D, U+15E and U+161 and 
U+162), which look a little bit alike. When the Commission 
finally managed to force Mirosoft to correct the fonts to include 
them, we could start to correct the data. The transition was 
finished in 2012 and was only possible because no other language 
we deal with uses the "wrong" codepoints (Turkish but fortunately 
we only have a handful of them in our db's). So 5 years of ad hoc 
processing for the substicion of 4 codepoints.
BTW: using combining diacritics was out of the question at that 
time simply because Microsoft Word didn't support it at that time 
and many documents we encountered still only used codepages (one 
has also to remember that in big institution like the EC, the IT 
is always several years behind the open market, which means that 
when product is in release X, the Institution still might use 
release X-5 years).

Re: The Case Against Autodecode

2016-06-04 Thread H. S. Teoh via Digitalmars-d

On Fri, Jun 03, 2016 at 08:03:16PM -0700, Walter Bright via Digitalmars-d wrote:
> On 6/3/2016 6:08 PM, H. S. Teoh via Digitalmars-d wrote:
> > It's not a hard concept, except that these different letters have
> > lookalike forms with completely unrelated letters. Again:
> > 
> > - Lowercase Latin m looks visually the same as lowercase Cyrillic Т
> > in cursive form. In some font renderings the two are IDENTICAL
> > glyphs, in spite of being completely different, unrelated letters.
> > However, in non-cursive form, Cyrillic lowercase т is visually
> > distinct.
> > 
> > - Similarly, lowercase Cyrillic П in cursive font looks like
> > lowercase Latin n, and in some fonts they are identical glyphs.
> > Again, completely unrelated letters, yet they have the SAME VISUAL
> > REPRESENTATION.  However, in non-cursive font, lowercase Cyrillic П
> > is п, which is visually distinct from Latin n.
> > 
> > - These aren't the only ones, either.  Other Cyrillic false friends
> > include cursive Д, which in some fonts looks like lowercase Latin g.
> > But in non-cursive font, it's д.
> > 
> > Just given the above, it should be clear that going by visual
> > representation is NOT enough to disambiguate between these different
> > letters.
> 
> It works for books.

Because books don't allow their readers to change the font.

> Unicode invented a problem, and came up with a thoroughly wretched
> "solution" that we'll be stuck with for generations. One of those bad
> solutions is have the reader not know what a glyph actually is without
> pulling back the cover to read the codepoint. It's madness.

This madness already exists *without* Unicode. If you have a page with a
single glyph 'm' printed on it and show it to an English speaker, he
will say it's lowercase M. Show it to a Russian speaker, and he will say
it's lowercase Т.  So which letter is it, M or Т?

The fundamental problem is that writing systems for different languages
interpret the same letter forms differently.  In English, lowercase g
has at least two different forms that we recognize as the same letter.
However, to a Cyrillic reader the two forms are distinct, because one of
them looks like a Cyrillic letter but the other one looks foreign. So
should g be encoded as a single point or two different points?

In a similar vein, to a Cyrillic reader the glyphs т and m represent the
same letter, but to an English letter they are clearly two different
things.

If you're going to represent both languages, you cannot get away from
needing to represent letters abstractly, rather than visually.

> > By your argument, since lowercase Cyrillic Т is, visually, just m,
> > it should be encoded the same way as lowercase Latin m. But this is
> > untenable, because the letterform changes with a different font. So
> > you end up with the unworkable idea of a font-dependent encoding.
> 
> Oh rubbish. Let go of the idea that choosing bad fonts should drive
> Unicode codepoint decisions.

It's not a bad font. It's standard practice to print Cyrillic cursive
letters with different glyphs. Russian readers can read both without any
problem.  The same letter is represented by different glyphs, and
therefore the abstract letter is a more fundamental unit of meaning than
the glyph itself.

> > Or, to use an example closer to home, uppercase Latin O and the
> > digit 0 are visually identical. Should they be encoded as a single
> > code point or two?  Worse, in some fonts, the digit 0 is rendered
> > like Ø (to differentiate it from uppercase O). Does that mean that
> > it should be encoded the same way as the Danish letter Ø?  Obviously
> > not, but according to your "visual representation" idea, the answer
> > should be yes.
> 
> Don't confuse fonts with code points. It'd be adequate if Unicode
> defined a canonical glyph for each code point, and let the font makers
> do what they wish.

So should O and 0 share the same glyph or not? They're visually the same
thing, even though some fonts render them differently. What should be
the canonical shape of O vs. 0? If they are the same shape, then by your
argument they must be the same code point, regardless of what font
makers do to disambiguate them.  Good luck writing a parser that can't
tell between an identifier that begins with O vs. a number literal that
begins with 0.

The very fact that we distinguish between O and 0, independently of what
Unicode did/does, is already proof enough that going by visual
representation is inadequate.

> > > The notion of 'case' should not be part of Unicode, as that is
> > > semantic information that is beyond the scope of Unicode.
> > But what should "i".toUpper return?
> 
> Not relevant to my point that Unicode shouldn't decide what "upper
> case" for all languages means, any more than Unicode should specify a
> font. Now when you argue that Unicode should make such decisions, note
> what a spectacularly hopeless job of it they've done.

In other words toUpper and toLower does not belong in the

Re: The Case Against Autodecode

2016-06-03 Thread Walter Bright via Digitalmars-d


On 6/3/2016 6:08 PM, H. S. Teoh via Digitalmars-d wrote:

It's not a hard concept, except that these different letters have
lookalike forms with completely unrelated letters. Again:

- Lowercase Latin m looks visually the same as lowercase Cyrillic Т in
  cursive form. In some font renderings the two are IDENTICAL glyphs, in
  spite of being completely different, unrelated letters.  However, in
  non-cursive form, Cyrillic lowercase т is visually distinct.

- Similarly, lowercase Cyrillic П in cursive font looks like lowercase
  Latin n, and in some fonts they are identical glyphs. Again,
  completely unrelated letters, yet they have the SAME VISUAL
  REPRESENTATION.  However, in non-cursive font, lowercase Cyrillic П is
  п, which is visually distinct from Latin n.

- These aren't the only ones, either.  Other Cyrillic false friends
  include cursive Д, which in some fonts looks like lowercase Latin g.
  But in non-cursive font, it's д.

Just given the above, it should be clear that going by visual
representation is NOT enough to disambiguate between these different
letters.


It works for books. Unicode invented a problem, and came up with a thoroughly 
wretched "solution" that we'll be stuck with for generations. One of those bad 
solutions is have the reader not know what a glyph actually is without pulling 
back the cover to read the codepoint. It's madness.




By your argument, since lowercase Cyrillic Т is, visually,
just m, it should be encoded the same way as lowercase Latin m. But this
is untenable, because the letterform changes with a different font. So
you end up with the unworkable idea of a font-dependent encoding.


Oh rubbish. Let go of the idea that choosing bad fonts should drive Unicode 
codepoint decisions.




Or, to use an example closer to home, uppercase Latin O and the digit 0
are visually identical. Should they be encoded as a single code point or
two?  Worse, in some fonts, the digit 0 is rendered like Ø (to
differentiate it from uppercase O). Does that mean that it should be
encoded the same way as the Danish letter Ø?  Obviously not, but
according to your "visual representation" idea, the answer should be
yes.


Don't confuse fonts with code points. It'd be adequate if Unicode defined a 
canonical glyph for each code point, and let the font makers do what they wish.




The notion of 'case' should not be part of Unicode, as that is
semantic information that is beyond the scope of Unicode.

But what should "i".toUpper return?


Not relevant to my point that Unicode shouldn't decide what "upper case" for all 
languages means, any more than Unicode should specify a font. Now when you argue 
that Unicode should make such decisions, note what a spectacularly hopeless job 
of it they've done.

Re: The Case Against Autodecode

2016-06-03 Thread ketmar via Digitalmars-d


On Saturday, 4 June 2016 at 02:46:31 UTC, Walter Bright wrote:

On 6/3/2016 5:42 PM, ketmar wrote:

sometimes used Cyrillic font to represent English.


Nobody here suggested using the wrong font, it's completely 
irrelevant.


you suggested that unicode designers should make similar-looking 
glyphs share the same code, and it reminds me this little story. 
maybe i misunderstood you, though.

Re: The Case Against Autodecode

2016-06-03 Thread Walter Bright via Digitalmars-d


On 6/3/2016 5:42 PM, ketmar wrote:

sometimes used Cyrillic font to represent English.


Nobody here suggested using the wrong font, it's completely irrelevant.

Re: The Case Against Autodecode

2016-06-03 Thread H. S. Teoh via Digitalmars-d

On Fri, Jun 03, 2016 at 03:35:18PM -0700, Walter Bright via Digitalmars-d wrote:
> On 6/3/2016 1:53 PM, H. S. Teoh via Digitalmars-d wrote:
[...]
> > 'Cos by that argument, serif and sans serif letters should have
> > different encodings, because in languages like Hebrew, a tiny little
> > serif could mean the difference between two completely different
> > letters.
> 
> If they are different letters, then they should have a different code
> point.  I don't see why this is such a hard concept.
[...]

It's not a hard concept, except that these different letters have
lookalike forms with completely unrelated letters. Again:

- Lowercase Latin m looks visually the same as lowercase Cyrillic Т in
  cursive form. In some font renderings the two are IDENTICAL glyphs, in
  spite of being completely different, unrelated letters.  However, in
  non-cursive form, Cyrillic lowercase т is visually distinct.

- Similarly, lowercase Cyrillic П in cursive font looks like lowercase
  Latin n, and in some fonts they are identical glyphs. Again,
  completely unrelated letters, yet they have the SAME VISUAL
  REPRESENTATION.  However, in non-cursive font, lowercase Cyrillic П is
  п, which is visually distinct from Latin n.

- These aren't the only ones, either.  Other Cyrillic false friends
  include cursive Д, which in some fonts looks like lowercase Latin g.
  But in non-cursive font, it's д.

Just given the above, it should be clear that going by visual
representation is NOT enough to disambiguate between these different
letters.  By your argument, since lowercase Cyrillic Т is, visually,
just m, it should be encoded the same way as lowercase Latin m. But this
is untenable, because the letterform changes with a different font. So
you end up with the unworkable idea of a font-dependent encoding.

Similarly, since lowercase Cyrillic П is n (in cursive font), we should
encode it the same way as Latin lowercase n. But again, the letterform
changes based on font.  Your criteria of "same visual representation"
does not work outside of English.  What you imagine to be a simple,
straightforward concept is far from being simple once you're dealing
with the diverse languages and writing systems of the world.

Or, to use an example closer to home, uppercase Latin O and the digit 0
are visually identical. Should they be encoded as a single code point or
two?  Worse, in some fonts, the digit 0 is rendered like Ø (to
differentiate it from uppercase O). Does that mean that it should be
encoded the same way as the Danish letter Ø?  Obviously not, but
according to your "visual representation" idea, the answer should be
yes.

The bottomline is that uppercase O and the digit 0 represent different
LOGICAL entities, in spite of their sharing the same visual
representation.  Eventually you have to resort to representing *logical*
entities ("characters") rather than visual appearance, which is a
property of the font, and has no place in a digital text encoding.

> > Besides, that still doesn't solve the problem of what
> > "i".uppercase() should return. In most languages, it should return
> > "I", but in Turkish it should not.
> > And if we really went the route of encoding Cyrillic letters the
> > same as their Latin lookalikes, we'd have a problem with what
> > "m".uppercase() should return, because now it depends on which font
> > is in effect (if it's a Cyrillic cursive font, the correct answer is
> > "Т", if it's a Latin font, the correct answer is "M" -- the other
> > combinations: who knows).  That sounds far worse than what we have
> > today.
> 
> The notion of 'case' should not be part of Unicode, as that is
> semantic information that is beyond the scope of Unicode.

But what should "i".toUpper return?  Or are you saying the standard
library should not include such a basic function as a case-changing
function?

T

-- 
Customer support: the art of getting your clients to pay for your own
incompetence.

Re: The Case Against Autodecode

2016-06-03 Thread ketmar via Digitalmars-d


On Friday, 3 June 2016 at 18:43:07 UTC, Walter Bright wrote:
It's almost as if printed documents and books have never 
existed!
some old xUSSR books which has some English text sometimes used 
Cyrillic font to represent English. it was awful, and barely 
readable. this was done to ease the work of compositors, and the 
result was unacceptable. do you feel a recognizable pattern here? 
;-)

Re: The Case Against Autodecode

2016-06-03 Thread Adam D. Ruppe via Digitalmars-d


On Friday, 3 June 2016 at 22:38:38 UTC, Walter Bright wrote:

If a font choice changes the meaning then it is not a font.


Nah, then it is an Awesome Font that is totally Web Scale!

i wish i was making that up http://fontawesome.io/ i hate that 
thing


But, it is kinda legal: gotta love the Unicode private use area!

Re: The Case Against Autodecode

2016-06-03 Thread Walter Bright via Digitalmars-d


On 6/3/2016 2:10 PM, Jonathan M Davis via Digitalmars-d wrote:

Actually, I would argue that the moment that Unicode is concerned with what
the character actually looks like rather than what character it logically is
that it's gone outside of its charter. The way that characters actually look
is far too dependent on fonts, and aside from display code, code does not
care one whit what the character looks like.


What I meant was pretty clear. Font is an artistic style that does not change 
context nor semantic meaning. If a font choice changes the meaning then it is 
not a font.

Re: The Case Against Autodecode

2016-06-03 Thread Walter Bright via Digitalmars-d


On 6/3/2016 1:53 PM, H. S. Teoh via Digitalmars-d wrote:

But if we were to encode appearance instead of logical meaning, that
would mean the *same* lowercase Cyrillic ь would have multiple,
different encodings depending on which font was in use.


I don't see that consequence at all.



That doesn't
seem like the right solution either.  Do we really want Unicode strings
to encode font information too??


No.


 'Cos by that argument, serif and sans
serif letters should have different encodings, because in languages like
Hebrew, a tiny little serif could mean the difference between two
completely different letters.


If they are different letters, then they should have a different code point. I 
don't see why this is such a hard concept.




And what of the Arabic and Indic scripts? They would need to encode the
same letter multiple times, each being a variation of the physical form
that changes depending on the surrounding context. Even the Greek sigma
has two forms depending on whether it's at the end of a word or not --
so should it be two code points or one?


Two. Again, why is this hard to grasp? If there is meaning in having two 
different visual representations, then they are two codepoints. If the visual 
representation is the same, then it is one codepoint. If the difference is only 
due to font selection, that it is the same codepoint.




Besides, that still doesn't solve the problem of what "i".uppercase()
should return. In most languages, it should return "I", but in Turkish
it should not.
And if we really went the route of encoding Cyrillic
letters the same as their Latin lookalikes, we'd have a problem with
what "m".uppercase() should return, because now it depends on which font
is in effect (if it's a Cyrillic cursive font, the correct answer is
"Т", if it's a Latin font, the correct answer is "M" -- the other
combinations: who knows).  That sounds far worse than what we have
today.


The notion of 'case' should not be part of Unicode, as that is semantic 
information that is beyond the scope of Unicode.

Re: The Case Against Autodecode

2016-06-03 Thread Jonathan M Davis via Digitalmars-d

On Friday, June 03, 2016 03:08:43 Walter Bright via Digitalmars-d wrote:
> On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
> > At the time
> > Unicode also had to grapple with tricky issues like what to do with
> > lookalike characters that served different purposes or had different
> > meanings, e.g., the mu sign in the math block vs. the real letter mu in
> > the Greek block, or the Cyrillic A which looks and behaves exactly like
> > the Latin A, yet the Cyrillic Р, which looks like the Latin P, does
> > *not* mean the same thing (it's the equivalent of R), or the Cyrillic В
> > whose lowercase is в not b, and also had a different sound, but
> > lowercase Latin b looks very similar to Cyrillic ь, which serves a
> > completely different purpose (the uppercase is Ь, not B, you see).
>
> I don't see that this is tricky at all. Adding additional semantic meaning
> that does not exist in printed form was outside of the charter of Unicode.
> Hence there is no justification for having two distinct characters with
> identical glyphs.
>
> They should have put me in charge of Unicode. I'd have put a stop to much of
> the madness :-)

Actually, I would argue that the moment that Unicode is concerned with what
the character actually looks like rather than what character it logically is
that it's gone outside of its charter. The way that characters actually look
is far too dependent on fonts, and aside from display code, code does not
care one whit what the character looks like.

For instance, take the capital letter I, the lowercase letter l, and the
number one. In some fonts that are feeling cruel towards folks who actually
want to read them, two of those characters - or even all three of them -
look identical. But I think that you'll agree that those characters should
be represented as distinct characters in Unicode regardless of what they
happen to look like in a particular font.

Now, take a cyrllic letter that looks similar to a latin letter. If they're
logically equivalent such that no code would ever want to distinguish
between the two and such that no font would ever even consider representing
them differently, then they're truly the same letter, and they should only
have one Unicode representation. But if anyone would ever consider them to
be logically distinct, then it makes no sense for them to be considered to
be the same character by Unicode, because they don't have the same identity.
And that distinction is quite clear if any font would ever consider
representing the two characters differently, no matter how slight that
difference might be.

Really, what a character looks like has nothing to do with Unicode. The
exact same Unicode is used regardless of how the text is displayed. Rather,
what Unicode is doing is providing logical identifiers for characters so
that code can operate on them, and display code can then do whatever it does
to display those characters, whether they happen to look similar or not. I
would think that the fact that non-display code does not care one whit about
what a character looks like and that display code can have drastically
different visual representations for the same character would make it clear
that Unicode is concerned with having identifiers for logical characters and
that that is distinct from any visual representation.

- Jonathan M Davis

Re: The Case Against Autodecode

2016-06-03 Thread H. S. Teoh via Digitalmars-d

On Fri, Jun 03, 2016 at 11:43:07AM -0700, Walter Bright via Digitalmars-d wrote:
> On 6/3/2016 9:28 AM, H. S. Teoh via Digitalmars-d wrote:
> > Eventually you have no choice but to encode by logical meaning
> > rather than by appearance, since there are many lookalikes between
> > different languages that actually mean something completely
> > different, and often behaves completely differently.
> 
> It's almost as if printed documents and books have never existed!

But if we were to encode appearance instead of logical meaning, that
would mean the *same* lowercase Cyrillic ь would have multiple,
different encodings depending on which font was in use. That doesn't
seem like the right solution either.  Do we really want Unicode strings
to encode font information too??  'Cos by that argument, serif and sans
serif letters should have different encodings, because in languages like
Hebrew, a tiny little serif could mean the difference between two
completely different letters.

And what of the Arabic and Indic scripts? They would need to encode the
same letter multiple times, each being a variation of the physical form
that changes depending on the surrounding context. Even the Greek sigma
has two forms depending on whether it's at the end of a word or not --
so should it be two code points or one? If you say two, then you'd have
a problem with how to search for sigma in Greek text, and you'd have to
search for either medial sigma or final sigma. But if you say one, then
you'd have a problem with having two different letterforms for a single
codepoint.

Besides, that still doesn't solve the problem of what "i".uppercase()
should return. In most languages, it should return "I", but in Turkish
it should not.  And if we really went the route of encoding Cyrillic
letters the same as their Latin lookalikes, we'd have a problem with
what "m".uppercase() should return, because now it depends on which font
is in effect (if it's a Cyrillic cursive font, the correct answer is
"Т", if it's a Latin font, the correct answer is "M" -- the other
combinations: who knows).  That sounds far worse than what we have
today.

T

-- 
Let's eat some disquits while we format the biskettes.

Re: The Case Against Autodecode

2016-06-03 Thread Walter Bright via Digitalmars-d


On 6/3/2016 11:54 AM, Timon Gehr wrote:

On 03.06.2016 20:41, Walter Bright wrote:

How did people ever get by with printed books and documents?

They can disambiguate the letters based on context well enough.


Characters do not have semantic meaning. Their meaning is always inferred from 
the context. Unicode's troubles started the moment they stepped beyond their 
charter.

Re: The Case Against Autodecode

2016-06-03 Thread Dmitry Olshansky via Digitalmars-d


On 02-Jun-2016 23:27, Walter Bright wrote:

On 6/2/2016 12:34 PM, deadalnix wrote:

On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:

Pretty much everything. Consider s and s1 string variables with possibly
different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It returns
always false
without.



False. Many characters can be represented by different sequences of
codepoints.
For instance, ê can be ê as one codepoint or ^ as a modifier followed
by e. ö is
one such character.


There are 3 levels of Unicode support. What Andrei is talking about is
Level 1.

http://unicode.org/reports/tr18/tr18-5.1.html

I wonder what rationale there is for Unicode to have two different
sequences of codepoints be treated as the same. It's madness.


Yeah, Unicode was not meant to be easy it seems. Or this is whatever 
happens with evolutionary design that started with "everything is a 
16-bit character".


--
Dmitry Olshansky

Re: The Case Against Autodecode

2016-06-03 Thread Adam D. Ruppe via Digitalmars-d


On Friday, 3 June 2016 at 18:41:36 UTC, Walter Bright wrote:

How did people ever get by with printed books and documents?


Printed books pick one font and one layout, then is read by 
people. It doesn't have to be represented in some format where 
end users can change the font and size etc.

Re: The Case Against Autodecode

2016-06-03 Thread Timon Gehr via Digitalmars-d


On 03.06.2016 20:41, Walter Bright wrote:

On 6/3/2016 3:14 AM, Vladimir Panteleev wrote:

That's not right either. Cyrillic letters can look slightly different
from their
latin lookalikes in some circumstances.

I'm sure there are extremely good reasons for not using the latin
lookalikes in
the Cyrillic alphabets, because most (all?) 8-bit Cyrillic encodings use
separate codes for the lookalikes. It's not restricted to Unicode.



How did people ever get by with printed books and documents?


They can disambiguate the letters based on context well enough.

Re: The Case Against Autodecode

2016-06-03 Thread Walter Bright via Digitalmars-d


On 6/3/2016 9:28 AM, H. S. Teoh via Digitalmars-d wrote:

Eventually you have no choice but to encode by logical meaning rather
than by appearance, since there are many lookalikes between different
languages that actually mean something completely different, and often
behaves completely differently.


It's almost as if printed documents and books have never existed!

Re: The Case Against Autodecode

2016-06-03 Thread Walter Bright via Digitalmars-d


On 6/3/2016 3:14 AM, Vladimir Panteleev wrote:

That's not right either. Cyrillic letters can look slightly different from their
latin lookalikes in some circumstances.

I'm sure there are extremely good reasons for not using the latin lookalikes in
the Cyrillic alphabets, because most (all?) 8-bit Cyrillic encodings use
separate codes for the lookalikes. It's not restricted to Unicode.



How did people ever get by with printed books and documents?

Re: The Case Against Autodecode

2016-06-03 Thread Walter Bright via Digitalmars-d


On 6/3/2016 3:10 AM, Vladimir Panteleev wrote:

I don't think it would work (or at least, the analogy doesn't hold). It would
mean that you can't add new precomposited characters, because that means that
previously valid sequences are now invalid.


So don't add new precomposited characters when a recognized existing sequence 
exists.

Re: The Case Against Autodecode

2016-06-03 Thread H. S. Teoh via Digitalmars-d

On Fri, Jun 03, 2016 at 10:14:15AM +, Vladimir Panteleev via Digitalmars-d 
wrote:
> On Friday, 3 June 2016 at 10:08:43 UTC, Walter Bright wrote:
> > On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
> > > At the time Unicode also had to grapple with tricky issues like
> > > what to do with lookalike characters that served different
> > > purposes or had different meanings, e.g., the mu sign in the math
> > > block vs. the real letter mu in the Greek block, or the Cyrillic A
> > > which looks and behaves exactly like the Latin A, yet the Cyrillic
> > > Р, which looks like the Latin P, does *not* mean the same thing
> > > (it's the equivalent of R), or the Cyrillic В whose lowercase is в
> > > not b, and also had a different sound, but lowercase Latin b looks
> > > very similar to Cyrillic ь, which serves a completely different
> > > purpose (the uppercase is Ь, not B, you see).
> > 
> > I don't see that this is tricky at all. Adding additional semantic
> > meaning that does not exist in printed form was outside of the
> > charter of Unicode. Hence there is no justification for having two
> > distinct characters with identical glyphs.
> 
> That's not right either. Cyrillic letters can look slightly different
> from their latin lookalikes in some circumstances.
> 
> I'm sure there are extremely good reasons for not using the latin
> lookalikes in the Cyrillic alphabets, because most (all?) 8-bit
> Cyrillic encodings use separate codes for the lookalikes. It's not
> restricted to Unicode.

Yeah, lowercase Cyrillic П is п, which looks like lowercase Greek π in
some fonts, but in cursive form it looks more like Latin lowercase n.
It wouldn't make sense to encode Cyrillic п the same as Greek π or Latin
lowercase n just by appearance, since logically it stands as its own
character despite its various appearances.  But it wouldn't make sense
to encode it differently just because you're using a different font!
Similarly, lowercase Cyrillic т in some cursive fonts looks like
lowercase Latin m.  I don't think it would make sense to encode
lowercase Т as Latin m just because of that.

Eventually you have no choice but to encode by logical meaning rather
than by appearance, since there are many lookalikes between different
languages that actually mean something completely different, and often
behaves completely differently.

T

-- 
People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG

Re: The Case Against Autodecode

2016-06-03 Thread Nick Sabalausky via Digitalmars-d


On 06/02/2016 05:37 PM, Andrei Alexandrescu wrote:

On 6/2/16 5:35 PM, deadalnix wrote:

On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:

On 6/2/16 5:20 PM, deadalnix wrote:

The good thing when you define works by whatever it does right now


No, it works as it was designed. -- Andrei


Nobody says it doesn't. Everybody says the design is crap.


I think I like it more after this thread. -- Andrei


Well there's a fantastic argument.

Re: The Case Against Autodecode

2016-06-03 Thread Chris via Digitalmars-d

On Friday, 3 June 2016 at 11:46:50 UTC, Jonathan M Davis wrote:
On Friday, June 03, 2016 10:10:18 Vladimir Panteleev via 
Digitalmars-d wrote:

On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote:
> On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
>> However, this
>> meant that some precomposed characters were "redundant": 
>> they

>> represented character + diacritic combinations that could
>> equally well
>> be expressed separately. Normalization was the inevitable
>> consequence.
>
> It is not inevitable. Simply disallow the 2 codepoint 
> sequences - the single one has to be used instead.

>
> There is precedent. Some characters can be encoded with more 
> than one UTF-8 sequence, and the longer sequences were 
> declared invalid. Simple.

>
> I.e. have the normalization up front when the text is 
> created rather than everywhere else.

I don't think it would work (or at least, the analogy doesn't 
hold). It would mean that you can't add new precomposited 
characters, because that means that previously valid sequences 
are now invalid.

I would have argued that no composited characters should have 
ever existed regardless of what was done in previous encodings, 
since they're redundant, and you need the non-composited 
characters to avoid a combinatorial explosion of characters, so 
you can't have characters that just have a composited version 
and be consistent. However, the Unicode folks obviously didn't 
go that route. But given where we sit now, even though we're 
stuck with some composited characters, I'd argue that we should 
at least never add any new ones. But who knows what the Unicode 
folks are actually going to do.

As it is, you probably should normalize strings in many cases 
where they enter the program, just like ideally, you'd validate 
them when they enter the program. But regardless, you have to 
deal with the fact that multiple normalization schemes exist 
and that there's no guarantee that you're even going to get 
valid Unicode, let alone Unicode that's normalized the way you 
want.

- Jonathan M Davis

I do exactly this. Validate and normalize.

Re: The Case Against Autodecode

2016-06-03 Thread Jonathan M Davis via Digitalmars-d

On Friday, June 03, 2016 10:10:18 Vladimir Panteleev via Digitalmars-d wrote:
> On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote:
> > On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
> >> However, this
> >> meant that some precomposed characters were "redundant": they
> >> represented character + diacritic combinations that could
> >> equally well
> >> be expressed separately. Normalization was the inevitable
> >> consequence.
> >
> > It is not inevitable. Simply disallow the 2 codepoint sequences
> > - the single one has to be used instead.
> >
> > There is precedent. Some characters can be encoded with more
> > than one UTF-8 sequence, and the longer sequences were declared
> > invalid. Simple.
> >
> > I.e. have the normalization up front when the text is created
> > rather than everywhere else.
>
> I don't think it would work (or at least, the analogy doesn't
> hold). It would mean that you can't add new precomposited
> characters, because that means that previously valid sequences
> are now invalid.

I would have argued that no composited characters should have ever existed
regardless of what was done in previous encodings, since they're redundant,
and you need the non-composited characters to avoid a combinatorial
explosion of characters, so you can't have characters that just have a
composited version and be consistent. However, the Unicode folks obviously
didn't go that route. But given where we sit now, even though we're stuck
with some composited characters, I'd argue that we should at least never add
any new ones. But who knows what the Unicode folks are actually going to do.

As it is, you probably should normalize strings in many cases where they
enter the program, just like ideally, you'd validate them when they enter
the program. But regardless, you have to deal with the fact that multiple
normalization schemes exist and that there's no guarantee that you're even
going to get valid Unicode, let alone Unicode that's normalized the way you
want.

- Jonathan M Davis

Re: The Case Against Autodecode

2016-06-03 Thread Vladimir Panteleev via Digitalmars-d


On Friday, 3 June 2016 at 10:08:43 UTC, Walter Bright wrote:

On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:

At the time
Unicode also had to grapple with tricky issues like what to do 
with
lookalike characters that served different purposes or had 
different
meanings, e.g., the mu sign in the math block vs. the real 
letter mu in
the Greek block, or the Cyrillic A which looks and behaves 
exactly like
the Latin A, yet the Cyrillic Р, which looks like the Latin P, 
does
*not* mean the same thing (it's the equivalent of R), or the 
Cyrillic В

whose lowercase is в not b, and also had a different sound, but
lowercase Latin b looks very similar to Cyrillic ь, which 
serves a
completely different purpose (the uppercase is Ь, not B, you 
see).


I don't see that this is tricky at all. Adding additional 
semantic meaning that does not exist in printed form was 
outside of the charter of Unicode. Hence there is no 
justification for having two distinct characters with identical 
glyphs.


That's not right either. Cyrillic letters can look slightly 
different from their latin lookalikes in some circumstances.


I'm sure there are extremely good reasons for not using the latin 
lookalikes in the Cyrillic alphabets, because most (all?) 8-bit 
Cyrillic encodings use separate codes for the lookalikes. It's 
not restricted to Unicode.

Re: The Case Against Autodecode

2016-06-03 Thread Vladimir Panteleev via Digitalmars-d


On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote:

On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:

However, this
meant that some precomposed characters were "redundant": they
represented character + diacritic combinations that could 
equally well
be expressed separately. Normalization was the inevitable 
consequence.


It is not inevitable. Simply disallow the 2 codepoint sequences 
- the single one has to be used instead.


There is precedent. Some characters can be encoded with more 
than one UTF-8 sequence, and the longer sequences were declared 
invalid. Simple.


I.e. have the normalization up front when the text is created 
rather than everywhere else.


I don't think it would work (or at least, the analogy doesn't 
hold). It would mean that you can't add new precomposited 
characters, because that means that previously valid sequences 
are now invalid.

Re: The Case Against Autodecode

2016-06-03 Thread Walter Bright via Digitalmars-d


On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:

At the time
Unicode also had to grapple with tricky issues like what to do with
lookalike characters that served different purposes or had different
meanings, e.g., the mu sign in the math block vs. the real letter mu in
the Greek block, or the Cyrillic A which looks and behaves exactly like
the Latin A, yet the Cyrillic Р, which looks like the Latin P, does
*not* mean the same thing (it's the equivalent of R), or the Cyrillic В
whose lowercase is в not b, and also had a different sound, but
lowercase Latin b looks very similar to Cyrillic ь, which serves a
completely different purpose (the uppercase is Ь, not B, you see).


I don't see that this is tricky at all. Adding additional semantic meaning that 
does not exist in printed form was outside of the charter of Unicode. Hence 
there is no justification for having two distinct characters with identical glyphs.


They should have put me in charge of Unicode. I'd have put a stop to much of the 
madness :-)

Re: The Case Against Autodecode

2016-06-03 Thread Walter Bright via Digitalmars-d


On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:

However, this
meant that some precomposed characters were "redundant": they
represented character + diacritic combinations that could equally well
be expressed separately. Normalization was the inevitable consequence.


It is not inevitable. Simply disallow the 2 codepoint sequences - the single one 
has to be used instead.


There is precedent. Some characters can be encoded with more than one UTF-8 
sequence, and the longer sequences were declared invalid. Simple.


I.e. have the normalization up front when the text is created rather than 
everywhere else.

Re: The Case Against Autodecode

2016-06-03 Thread Jonathan M Davis via Digitalmars-d

On Thursday, June 02, 2016 15:05:44 Andrei Alexandrescu via Digitalmars-d 
wrote:
> The intent of autodecoding was to make std.algorithm work meaningfully
> with strings. As it's easy to see I just went through
> std.algorithm.searching alphabetically and found issues literally with
> every primitive in there. It's an easy exercise to go forth with the others.

It comes down to the question of whether it's better to fail quickly when
Unicode is handled incorrectly so that it's obvious that you're doing it
wrong, or whether it's better for it to work in a large number of cases so
that for a lot of code it "just works" but is still wrong in the general
case, and it's a lot less obvious that that's the case, so many folks won't
realize that they need to do more in order to have their string handling be
Unicode-correct.

With code units - especially UTF-8 - it becomes obvious very quickly that
treating each element of the string/range as a character is wrong. With code
points, you have to work far harder to find examples that are incorrect. So,
it's not at all obvious (especially to the lay programmer) that the Unicode
handling is incorrect and that their code is wrong - but their code will end
up working a large percentage of the time in spite of it being wrong in the
general case.

So, yes, it's trivial to show how operating on ranges of code units as if
they were characters gives incorrect results far more easily than operating
on ranges of code points does. But operating on code points as if they were
characters is still going to give incorrect results in the general case.

Regardless of auto-decoding, the anwser is that the programmer needs to
understand the Unicode issues and use ranges of code units or code points
where appropriate and use ranges of graphemes where appropriate. It's just
that if we default to handling code points, then a lot of code will be
written which treats those as characters, and it will provide the correct
result more often than it would if it treated code units as characters.

In any case, I've probably posted too much in this thread already. It's
clear that the first step to solving this problem is to improve Phobos so
that it handles ranges of code units, code points, and graphemes correctly
whether auto-decoding is involved or not, and only then can we consider the
possibility of removing auto-decoding (and even then, the answer may still
be that we're stuck, because we consider the resulting code breakage to be
too great). But whether Phobos retains auto-decoding or not, the Unicode
handling stuff in general is the same, and what we need to do to improve the
siutation is the same. So, clearly, I need to do a much better job of
finding time to work on D so that I can create some PRs to help the
situation.  Unfortunately, it's far easier to find a few minutes here and
there while waiting on other stuff to shoot off a post or two in the
newsgroup than it is to find time to substantively work on code. :|

- Jonathan M Davis

Re: The Case Against Autodecode

2016-06-03 Thread H. S. Teoh via Digitalmars-d

On Thu, Jun 02, 2016 at 05:19:48PM -0700, Walter Bright via Digitalmars-d wrote:
> On 6/2/2016 3:27 PM, John Colvin wrote:
> > > I wonder what rationale there is for Unicode to have two different
> > > sequences of codepoints be treated as the same. It's madness.
> > 
> > There are languages that make heavy use of diacritics, often several
> > on a single "character". Hebrew is a good example. Should there be
> > only one valid ordering of any given set of diacritics on any given
> > character?
> 
> I didn't say ordering, I said there should be no such thing as
> "normalization" in Unicode, where two codepoints are considered to be
> identical to some other codepoint.

I think it was a combination of historical baggage and trying to
accomodate unusual but still valid use cases.

The historical baggage was that Unicode was trying to unify all of the
various already-existing codepages out there, and many of those
codepages already come with various precomposed characters. To maximize
compatibility with existing codepages, Unicode tried to preserve as much
of the original mappings as possible within each 256-point block, so
these precomposed characters became part of the standard.

However, there weren't enough of them -- some people demanded less
common character + diacritic combinations, and some languages had
writing so complex their characters had to be composed from more basic
parts. The original Unicode range was 16-bit, so there wasn't enough
room to fit all of the precomposed characters people demanded, plus
there were other things people wanted, like multiple diacritics (e.g.,
in IPA). So the concept of combining diacritics was invented, in part to
prevent combinatorial explosion from soaking up the available code point
space, in part to allow for novel combinations of diacritics that
somebody out there somewhere might want to represent.  However, this
meant that some precomposed characters were "redundant": they
represented character + diacritic combinations that could equally well
be expressed separately. Normalization was the inevitable consequence.
(Normalization, of course, also subsumes a few other things, such as
collation, but this is one of the factors behind it.)

(This is a greatly over-simplified description, of course. At the time
Unicode also had to grapple with tricky issues like what to do with
lookalike characters that served different purposes or had different
meanings, e.g., the mu sign in the math block vs. the real letter mu in
the Greek block, or the Cyrillic A which looks and behaves exactly like
the Latin A, yet the Cyrillic Р, which looks like the Latin P, does
*not* mean the same thing (it's the equivalent of R), or the Cyrillic В
whose lowercase is в not b, and also had a different sound, but
lowercase Latin b looks very similar to Cyrillic ь, which serves a
completely different purpose (the uppercase is Ь, not B, you see). Then
you have the wonderful Indic and Arabic cursive writings, where
letterforms mutate depending on the surrounding context, which, if you
were to include all variants as distinct code points, would occupy many
more pages than they currently do.  And also sticky issues like the
oft-mentioned Turkish i, which is encoded as a Latin i but behaves
differently w.r.t. upper/lowercasing when in Turkish locale -- some
cases of this, IIRC, are unfixable bugs in Phobos because we currently
do not handle locales. So you see, imagining that code points == the
solution to Unicode string handling is a joke. Writing correct Unicode
handling is *hard*.)

As with all sufficiently complex software projects, Unicode represents a
compromise between many contradictory factors -- writing systems in the
world being the complex, not-very-consistent beasts they are -- so such
"dirty" details are somewhat inevitable.

T

-- 
Debugging is twice as hard as writing the code in the first place. Therefore, 
if you write the code as cleverly as possible, you are, by definition, not 
smart enough to debug it. -- Brian W. Kernighan

Re: The Case Against Autodecode

2016-06-03 Thread Marco Leise via Digitalmars-d

Am Thu, 2 Jun 2016 18:54:21 -0400
schrieb Andrei Alexandrescu :

> On 06/02/2016 06:10 PM, Marco Leise wrote:
> > Am Thu, 2 Jun 2016 15:05:44 -0400
> > schrieb Andrei Alexandrescu :
> >  
> >> On 06/02/2016 01:54 PM, Marc Schütz wrote:  
> >>> Which practical tasks are made possible (and work _correctly_) if you
> >>> decode to code points, that don't already work with code units?  
> >>
> >> Pretty much everything.
> >>
> >> s.all!(c => c == 'ö')  
> >
> > Andrei, your ignorance is really starting to grind on
> > everyones nerves.  
> 
> Indeed there seem to be serious questions about my competence, basic 
> comprehension, and now knowledge.

That's not my general impression, but something is different
with this thread.

> I understand it is tempting to assume that a disagreement is caused by 
> the other simply not understanding the matter. Even if that were true 
> it's not worth sacrificing civility over it.

Civility has had us caught in an 36 pages long, tiresome
debate with us mostly talking past each other. I was being
impolite and can't say I regret it, because I prefer this
answer over the rest of the thread. It's more informed,
elaborate and conclusive.

> > If after 350 posts you still don't see
> > why this is incorrect: s.any!(c => c == 'o'), you must be
> > actively skipping the informational content of this thread.  
> 
> Is it 'o' with an umlaut or without?
>
> At any rate, consider s of type string and x of type dchar.
> The dchar type is defined as "a Unicode code point", or at
> least my understanding that has been a reasonable definition
> to operate with in the D language ever since its first
> release. Also in the D language, the various string types
> char[], wchar[] etc. with their respective qualified
> versions are meant to hold Unicode strings with one of the
> UTF8, UTF16, and UTF32 encodings.
>
> Following these definitions, it stands to reason to infer that the call 
> s.find(c => c == x) means "find the code point x in string s and return 
> the balance of s positioned there". It's prima facie application of the 
> definitions of the entities involved.
> 
> Is this the only possible or recommended meaning? Most likely not, viz. 
> the subtle cases in which a given grapheme is represented via either one 
> or multiple code points by means of combining characters. Is it the best 
> possible meaning? It's even difficult to define what "best" means 
> (fastest, covering most languages, etc).
> 
> I'm not claiming that meaning is the only possible, the only 
> recommended, or the best possible. All I'm arguing is that it's not 
> retarded, and within a certain universe confined to operating at code 
> point level (which is reasonable per the definitions of the types 
> involved) it can be considered correct.
> 
> If at any point in the reasoning above some rampant ignorance comes 
> about, please point it out.

No, it's pretty close now. We can all agree that there is no
"best" way, only different use cases. Just defining Phobos to
work on code points gives the false illusion that it does the
correct thing in all use cases - after all D claims to support
Unicode. But in case you wanted to iterate on visual letters
it is incorrect and otherwise slow when you work on ASCII
structured formats (JSON, XML, paths, Warp, ...). Then there is
explaining the different default iteration schemes when using
foreach vs. range API (no big deal, just not easily justified)
and the cost of implementation when dealing with
char[]/wchar[].

From this observation we concluded that decoding should be
opt-in and that when we need it, it should be a conscious
decision. Unicode is quite complex and learning about the
difference between code points and grapheme clusters when
segmenting strings will benefit code quality.

As for the question, do multi-code-point graphemes ever appear
in the wild ? OS X is known to use NFD on its native file
system and there is a hint on Wikipedia that some symbols from
Thai or Hindi's Devanagari need them:
https://en.wikipedia.org/wiki/UTF-8#Disadvantages
Some form of Lithuanian seems to have a use for them, too:
http://www.unicode.org/L2/L2012/12026r-n4191-lithuanian.pdf
Aside from those there is nothing generally wrong about
decomposed letters appearing in strings, even though the
use of NFC is encouraged.

> > […harsh tone removed…] in the end we have to assume you
> > will make a decisive vote against any PR with the intent
> > to remove auto-decoding from Phobos.  
> 
> This seems to assume I have some vesting in the position
> that makes it independent of facts. That is not the case. I
> do what I think is right to do, and you do what you think is
> right to do.

Your vote outweighs that of many others for better or worse.
When a decision needs to be made and the community is divided,
we need you or Walter or anyone who is invested in the matter
to cast a ruling vote. However when several dozen people

Re: The Case Against Autodecode

2016-06-03 Thread H. S. Teoh via Digitalmars-d

On Thu, Jun 02, 2016 at 04:29:48PM -0400, Andrei Alexandrescu via Digitalmars-d 
wrote:
> On 06/02/2016 04:22 PM, cym13 wrote:
> > 
> > A:“We should decode to code points”
> > B:“No, decoding to code points is a stupid idea.”
> > A:“No it's not!”
> > B:“Can you show a concrete example where it does something useful?”
> > A:“Sure, look at that!”
> > B:“This isn't working at all, look at all those counter-examples!”
> > A:“It may not work for your examples but look how easy it is to
> > find code points!”
> 
> With autodecoding all of std.algorithm operates correctly on code points.
> Without it all it does for strings is gibberish. -- Andrei

With ASCII strings, all of std.algorithm operates correctly on ASCII
bytes. So let's standardize on ASCII strings.

What a vacuous argument! Basically you're saying "I define code points
to be correct. Therefore, I conclude that decoding to code points is
correct."  Well, duh.  Unfortunately such vacuous conclusions have no
bearing in the real world of Unicode handling.

T

-- 
I am Ohm of Borg. Resistance is voltage over current.

Re: The Case Against Autodecode

2016-06-03 Thread H. S. Teoh via Digitalmars-d

On Thu, Jun 02, 2016 at 04:28:45PM -0400, Andrei Alexandrescu via Digitalmars-d 
wrote:
> On 06/02/2016 04:17 PM, Timon Gehr wrote:
> > I.e. you are saying that 'works' means 'operates on code points'.
> 
> Affirmative. -- Andrei

Again, a ridiculous position.  I can use exactly the same line of
argument for why we should just standardize on ASCII. All I have to do
is to define "work" to mean "operates on an ASCII character", and then
every ASCII algorithm "works" by definition, so nobody can argue with
me.

Unfortunately, everybody else's definition of "work" is different from
mine, so the argument doesn't hold water.

Similarly, you are the only one whose definition of "work" means
"operates on code points". Basically nobody else here uses that
definition, so while you may be right according to your own made-up
tautological arguments, none of your conclusions actually have any
bearing in the real world of Unicode handling.

Give it up. It is beyond reasonable doubt that autodecoding is a
liability. D should be moving away from autodecoding instead of clinging
to historical mistakes in the face of overwhelming evidence. (And note,
I said *auto*-decoding; decoding by itself obviously is very relevant.
But it needs to be opt-in because of its performance and correctness
implications. The user needs to be able to choose whether to decode, and
how to decode.)

T

-- 
Freedom: (n.) Man's self-given right to be enslaved by his own depravity.

Re: The Case Against Autodecode

2016-06-03 Thread H. S. Teoh via Digitalmars-d

On Thu, Jun 02, 2016 at 04:38:28PM -0400, Andrei Alexandrescu via Digitalmars-d 
wrote:
> On 06/02/2016 04:36 PM, tsbockman wrote:
> > Your examples will pass or fail depending on how (and whether) the
> > 'ö' grapheme is normalized.
> 
> And that's fine. Want graphemes, .byGrapheme wags its tail in that
> corner.  Otherwise, you work on code points which is a completely
> meaningful way to go about things. What's not meaningful is the random
> results you get from operating on code units.
> 
> > They only ever succeeds because 'ö' happens to be one of the
> > privileged graphemes that *can* be (but often isn't!) represented as
> > a single code point. Many other graphemes have no such
> > representation.
> 
> Then there's no dchar for them so no problem to start with.
> 
> s.find(c) > "Find code unit c in string s"
[...]

This is a ridiculous argument.  We might as well say, "there's no single
byte UTF-8 that can represent Ш, so that's no problem to start with" --
since we can just define it away by saying s.find(c) == "find byte c in
string s", and thereby justify using ASCII as our standard string
representation.

The point is that dchar is NOT ENOUGH TO REPRESENT A SINGLE CHARACTER in
the general case.  It is adequate for a subset of characters -- just
like ASCII is also adequate for a subset of characters.  If you only
need to work with ASCII, it suffices to work with ubyte[]. Similarly, if
your work is restricted to only languages without combining diacritics,
then a range of dchar suffices. But a range of dchar is NOT good enough
in the general case, and arguing that it does only makes you look like a
fool.

Appealing to normalization doesn't change anything either, since only a
subset of base character + diacritic combinations will normalize to a
single code point. If the string has a base character + diacritic
combination doesn't have a precomposed code point, it will NOT fit in a
dchar. (And keep in mind that the notion of diacritic is still very
Euro-centric. In Korean, for example, a single character is composed of
multiple parts, each of which occupies 1 code point. While some
precomposed combinations do exist, they don't cover all of the
possibilities, so normalization won't help you there.)

T

-- 
Frank disagreement binds closer than feigned agreement.

Re: The Case Against Autodecode

2016-06-03 Thread tsbockman via Digitalmars-d


On Thursday, 2 June 2016 at 21:00:17 UTC, tsbockman wrote:
However, this document is very old - from Unicode 3.0 and the 
year 2000:


While there are no surrogate characters in Unicode 3.0 
(outside of private use characters), future versions of 
Unicode will contain them...


Perhaps level 1 has since been redefined?


I found the latest (unofficial) draft version:
http://www.unicode.org/reports/tr18/tr18-18.html

Relevant changes:

* Level 1 is to be redefined as working on code points, not code 
units:


A fundamental requirement is that Unicode text be interpreted 
semantically by code point, not code units.


* Level 2 (graphemes) is explicitly described as a "default 
level":


This is still a default level—independent of country or 
language—but provides much better support for end-user 
expectations than the raw level 1...


* All mention of level 2 being slow has been removed. The only 
reason given for making it toggle-able is for compatibility with 
level 1 algorithms:


Level 2 support matches much more what user expectations are 
for sequences of Unicode characters. It is still 
locale-independent and easily implementable. However, for 
compatibility with Level 1, it is useful to have some sort of 
syntax that will turn Level 2 support on and off.

Re: The Case Against Autodecode

2016-06-02 Thread Walter Bright via Digitalmars-d


On 6/2/2016 3:27 PM, John Colvin wrote:

I wonder what rationale there is for Unicode to have two different sequences
of codepoints be treated as the same. It's madness.


There are languages that make heavy use of diacritics, often several on a single
"character". Hebrew is a good example. Should there be only one valid ordering
of any given set of diacritics on any given character?


I didn't say ordering, I said there should be no such thing as "normalization" 
in Unicode, where two codepoints are considered to be identical to some other 
codepoint.

Re: The Case Against Autodecode

2016-06-02 Thread Walter Bright via Digitalmars-d


On 6/2/2016 2:25 PM, deadalnix wrote:

On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote:

I wonder what rationale there is for Unicode to have two different sequences
of codepoints be treated as the same. It's madness.

To be able to convert back and forth from/to unicode in a lossless manner.



Sorry, that makes no sense, as it is saying "they're the same, only different."

Re: The Case Against Autodecode

2016-06-02 Thread Walter Bright via Digitalmars-d


On 6/2/2016 4:29 PM, Jonathan M Davis via Digitalmars-d wrote:

How do you suggest that we handle the normalization issue?


Started a new thread for that one.

Re: The Case Against Autodecode

2016-06-02 Thread Jonathan M Davis via Digitalmars-d

On Thursday, June 02, 2016 15:48:03 Walter Bright via Digitalmars-d wrote:
> On 6/2/2016 3:23 PM, Andrei Alexandrescu wrote:
> > On 06/02/2016 05:58 PM, Walter Bright wrote:
> >>  > * s.balancedParens('〈', '〉') works only with autodecoding.
> >>  > * s.canFind('ö') works only with autodecoding. It returns always
> >>
> >> false without.
> >>
> >> Can be made to work without autodecoding.
> >
> > By special casing? Perhaps.
>
> The argument to canFind() can be detected as not being a char, then decoded
> into a sequence of char's, then forwarded to a substring search.

How do you suggest that we handle the normalization issue? Should we just
assume NFC like std.uni.normalize does and provide an optional template
argument to indicate a different normalization (like normalize does)? Since
without providing a way to deal with the normalization, we're not actually
making the code fully correct, just faster.

- Jonathan M Davis

Re: The Case Against Autodecode

2016-06-02 Thread Jonathan M Davis via Digitalmars-d

On Thursday, June 02, 2016 22:27:16 John Colvin via Digitalmars-d wrote:
> On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote:
> > I wonder what rationale there is for Unicode to have two
> > different sequences of codepoints be treated as the same. It's
> > madness.
>
> There are languages that make heavy use of diacritics, often
> several on a single "character". Hebrew is a good example. Should
> there be only one valid ordering of any given set of diacritics
> on any given character? It's an interesting idea, but it's not
> how things are.

Yeah. I'm inclined to think that the fact that there are multiple
normalizations was a huge mistake on the part of the Unicode folks, but
we're stuck dealing with it. And as horrible as it is for most cases, maybe
it _does_ ultimately make sense because of certain use cases; I don't know.
But bad idea or not, we're stuck. :(

- Jonathan M Davis

Re: The Case Against Autodecode

2016-06-02 Thread Jonathan M Davis via Digitalmars-d

On Thursday, June 02, 2016 18:23:19 Andrei Alexandrescu via Digitalmars-d 
wrote:
> On 06/02/2016 05:58 PM, Walter Bright wrote:
> > On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote:
> >> The lambda returns bool. -- Andrei
> >
> > Yes, I was wrong about that. But the point still stands with:
> >  > * s.balancedParens('〈', '〉') works only with autodecoding.
> >  > * s.canFind('ö') works only with autodecoding. It returns always
> >
> > false without.
> >
> > Can be made to work without autodecoding.
>
> By special casing? Perhaps. I seem to recall though that one major issue
> with autodecoding was that it special-cases certain algorithms. So you'd
> need to go through all of std.algorithm and make sure you can
> special-case your way out of situations that work today.

Yeah, I believe that you do have to do some special casing, though it would
be special casing on ranges of code units in general and not strings
specifically, and a lot of those functions are already special cased on
string in an attempt be efficient. In particular, with a function like find
or canFind, you'd take the needle and encode it to match the haystack it was
passed so that you can do the comparisons via code units. So, you incur the
encoding cost once when encoding the needle rather than incurring the
decoding cost of each code point or grapheme as you iterate over the
haystack. So, you end up with something that's correct and efficient. It's
also much friendlier to code that only operates on ASCII.

The one issue that I'm not quite sure how we'd handle in that case is
normalization (which auto-decoding doesn't handle either), since you'd need
to normalize the needle to match the haystack (which also assumes that the
haystack was already normalized). Certainly, it's the sort of thing that
makes it so that you kind of wish you were dealing with a string type that
had the normalization built into it rather than either an array of code
units or an arbitrary range of code units. But maybe we could assume the NFC
normalization like std.uni.normalize does and provide an optional template
argument for the normalization scheme.

In any case, while it's not entirely straightforward, it is quite possible
to write some algorithms in a way which works on arbitrary ranges of code
units and deals with Unicode correctly without auto-decoding or requiring
that the user convert it to a range of code points or graphemes in order to
properly handle the full range of Unicode. And even if we keep
auto-decoding, we pretty much need to fix it so that std.algorithm and
friends are Unicode-aware in this manner so that ranges of code units work
in general without requiring that you use byGrapheme. So, this sort of thing
could have a large impact on RCStr, even if we keep auto-decoding for narrow
strings.

Other algorithms, however, can't be made to work automatically with Unicode
- at least not with the current range paradigm. filter, for instance, really
needs to operate on graphemes to filter on characters, but with a range of
code units, that would mean operating on groups of code units as a single
element, which you can't do with something like a range of char, since that
essentially becomes a range of ranges. It has to be wrapped in a range
that's going to provide graphemes - and of course, if you know that you're
operating only on ASCII, then you wouldn't want to deal with graphemes
anyway, so automatically converting to graphemes would be undesirable. So,
for a function like filter, it really does have to be up to the programmer
to indicate what level of Unicode they want to operate at.

But if we don't make functions Unicode-aware where possible, then we're
going to take a performance hit by essentially forcing everyone to use
explicit ranges of code points or graphemes even when they should be
unnecessary. So, I think that we're stuck with some level of special casing,
but it would then be for ranges of code units and code points and not
strings. So, it would work efficiently for stuff like RCStr, which the
current scheme does not.

I think that the reality of the matter is that regardless of whether we keep
auto-decoding for narrow strings in place, we need to make Phobos operate on
arbitrary ranges of code units and code points, since even stuff like RCStr
won't work efficiently otherwise, and stuff like byCodeUnit won't be usuable
in as many cases otherwise, because if a generic function isn't
Unicode-aware, then in many cases, byCodeUnit will be very wrong, just like
byCodePoint would be wrong. So, as far as Phobos goes, I'm not sure that the
question of auto-decoding matters much for what we need to do at this point.
If we do what we need to do, then Phobos will work whether we have
auto-decoding or not (working in a Unicode-aware manner where possible and
forcing the user to decide the correct level of Unicode to work at where
not), and then it just becomes a question of whether we can or should
deprecate auto-decoding once all that's done.

-

Re: The Case Against Autodecode

2016-06-02 Thread Vladimir Panteleev via Digitalmars-d


On Thursday, 2 June 2016 at 21:56:10 UTC, Walter Bright wrote:

Yes, you have a good point. But we do allow things like:

   byte b;
   if (b == 1) ...


Why allowing char/wchar/dchar comparisons is wrong:

void main()
{
string s = "Привет";
foreach (c; s)
assert(c != 'Ñ');
}

From my post from 2014:

http://forum.dlang.org/post/knrwiqxhlvqwxqshy...@forum.dlang.org

Re: The Case Against Autodecode

2016-06-02 Thread Timon Gehr via Digitalmars-d

On 03.06.2016 00:23, Andrei Alexandrescu wrote:

On 06/02/2016 05:58 PM, Walter Bright wrote:

On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote:

The lambda returns bool. -- Andrei

Yes, I was wrong about that. But the point still stands with:

 > * s.balancedParens('〈', '〉') works only with autodecoding.
 > * s.canFind('ö') works only with autodecoding. It returns always
false without.

Can be made to work without autodecoding.

By special casing? Perhaps. I seem to recall though that one major issue
with autodecoding was that it special-cases certain algorithms.

The major issue is that it special cases when there's different, more 
natural semantics available.

Re: The Case Against Autodecode

2016-06-02 Thread Timon Gehr via Digitalmars-d


On 03.06.2016 00:26, Walter Bright wrote:

On 6/2/2016 3:11 PM, Timon Gehr wrote:

Well, this is a somewhat different case, because 1 is just not
representable
as a byte. Every value that fits in a byte fits in an int though.

It's different for code units. They are incompatible both ways.


Not exactly. (c == 'ö') is always false for the same reason that (b ==
1000) is always false.
...


Yes. And _additionally_, some other concerns apply that are not there 
for byte vs. int. I.e. if b == 1 is disallowed, then c == d should 
be disallowed too, but b == 1 can be allowed even if c == d is 
disallowed.



I'm not sure what the right answer is here.


char to dchar is a lossy conversion, so it shouldn't happen.
byte to int is a lossless conversion, so there is no problem a priori.

Re: The Case Against Autodecode

2016-06-02 Thread tsbockman via Digitalmars-d


On Thursday, 2 June 2016 at 22:20:49 UTC, Walter Bright wrote:

On 6/2/2016 2:05 PM, tsbockman wrote:

Presumably if someone marks their own
PR as "do not merge", it means they're planning to either 
close it themselves
after it has served its purpose, or they plan to fix/finish it 
and then remove

the "do not merge" label.


That doesn't seem to apply here, either.


Either way, they shouldn't be closed just because they say "do 
not merge"

(unless they're abandoned or something, obviously).


Something like that could not be merged until 132 other PRs are 
done to fix Phobos. It doesn't belong as a PR.


I was just responding to the general question you posed about "do 
not merge" PRs, not really arguing for that one, in particular, 
to be re-opened. I'm sure @wilzbach is willing to explain if 
anyone cares to ask him why he did it as a PR, though.

Re: The Case Against Autodecode

2016-06-02 Thread Walter Bright via Digitalmars-d


On 6/2/2016 3:10 PM, Marco Leise wrote:

we haven't looked into borrowing/scoped enough


That's my fault.

As for scoped, the idea is to make scope work analogously to DIP25's 'return 
ref'. I don't believe we need borrowing, we've worked out another solution that 
will work for ref counting.


Please do not reply to this in this thread - start a new one if you wish to 
continue with this topic.

Re: The Case Against Autodecode

2016-06-02 Thread John Colvin via Digitalmars-d


On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote:

On 6/2/2016 12:34 PM, deadalnix wrote:
On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu 
wrote:
Pretty much everything. Consider s and s1 string variables 
with possibly

different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It 
returns always false

without.



False. Many characters can be represented by different 
sequences of codepoints.
For instance, ê can be ê as one codepoint or ^ as a modifier 
followed by e. ö is

one such character.


There are 3 levels of Unicode support. What Andrei is talking 
about is Level 1.


http://unicode.org/reports/tr18/tr18-5.1.html

I wonder what rationale there is for Unicode to have two 
different sequences of codepoints be treated as the same. It's 
madness.


There are languages that make heavy use of diacritics, often 
several on a single "character". Hebrew is a good example. Should 
there be only one valid ordering of any given set of diacritics 
on any given character? It's an interesting idea, but it's not 
how things are.

Re: The Case Against Autodecode

2016-06-02 Thread Walter Bright via Digitalmars-d


On 6/2/2016 3:11 PM, Timon Gehr wrote:

Well, this is a somewhat different case, because 1 is just not representable
as a byte. Every value that fits in a byte fits in an int though.

It's different for code units. They are incompatible both ways.


Not exactly. (c == 'ö') is always false for the same reason that (b == 1000) is 
always false.


I'm not sure what the right answer is here.

Re: The Case Against Autodecode

2016-06-02 Thread Andrei Alexandrescu via Digitalmars-d

On 06/02/2016 05:58 PM, Walter Bright wrote:

On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote:

The lambda returns bool. -- Andrei

Yes, I was wrong about that. But the point still stands with:

 > * s.balancedParens('〈', '〉') works only with autodecoding.
 > * s.canFind('ö') works only with autodecoding. It returns always
false without.

Can be made to work without autodecoding.

By special casing? Perhaps. I seem to recall though that one major issue 
with autodecoding was that it special-cases certain algorithms. So you'd 
need to go through all of std.algorithm and make sure you can 
special-case your way out of situations that work today.

Andrei

Re: The Case Against Autodecode

2016-06-02 Thread tsbockman via Digitalmars-d


On Thursday, 2 June 2016 at 22:03:01 UTC, default0 wrote:

*sigh* reading comprehension.
...
Please do not take what I say out of context, thank you.


Earlier you said:

The level 2 support description noted that it should be opt-in 
because its slow.


My main point is simply that you mischaracterized what the 
standard says. Making level 1 opt-in, rather than level 2, would 
be just as compliant as the reverse. The standard makes no 
suggestion as to which should be default.

Re: The Case Against Autodecode

2016-06-02 Thread Marco Leise via Digitalmars-d

Am Thu, 2 Jun 2016 15:05:44 -0400
schrieb Andrei Alexandrescu :

> On 06/02/2016 01:54 PM, Marc Schütz wrote:
> > Which practical tasks are made possible (and work _correctly_) if you
> > decode to code points, that don't already work with code units?  
> 
> Pretty much everything.
>
> s.all!(c => c == 'ö')

Andrei, your ignorance is really starting to grind on
everyones nerves. If after 350 posts you still don't see
why this is incorrect: s.any!(c => c == 'o'), you must be
actively skipping the informational content of this thread.

You are in error, no one agrees with you, and you refuse to see
it and in the end we have to assume you will make a decisive
vote against any PR with the intent to remove auto-decoding
from Phobos.

Your so called vocal minority is actually D's panel of Unicode
experts who understand that auto-decoding is a false ally and
should be on the deprecation track.

Remember final-by-default? You promised, that your objection
about breaking code means that D2 will only continue to be
fixed in a backwards compatible way, be it the implementation
of shared or whatever else. Yet months later you opened a
thread with the title "inout must go". So that must have been
an appeasement back then. People don't forget these things
easily and RCStr seems to be a similar distraction,
considering we haven't looked into borrowing/scoped enough and
you promise wonders from it.

-- 
Marco

Re: The Case Against Autodecode

2016-06-02 Thread Timon Gehr via Digitalmars-d


On 02.06.2016 23:56, Walter Bright wrote:

On 6/2/2016 1:12 PM, Timon Gehr wrote:

...
It is not
meaningful to compare utf-8 and utf-16 code units directly.


Yes, you have a good point. But we do allow things like:

byte b;
if (b == 1) ...



Well, this is a somewhat different case, because 1 is just not 
representable as a byte. Every value that fits in a byte fits in an int 
though.


It's different for code units. They are incompatible both ways. E.g. 
dchar obviously does not fit in a char, and while the lower half of char 
is compatible with dchar, the upper half is specific to the encoding. 
dchar cannot represent upper half char code units. You get the code 
points with the corresponding values instead.


E.g.:

void main(){
import std.stdio,std.utf;
foreach(dchar d;"ö".byCodeUnit)
writeln(d); // "Ã", "¶"
}

Re: The Case Against Autodecode

2016-06-02 Thread default0 via Digitalmars-d


On Thursday, 2 June 2016 at 21:51:51 UTC, tsbockman wrote:

On Thursday, 2 June 2016 at 21:38:02 UTC, default0 wrote:

On Thursday, 2 June 2016 at 21:30:51 UTC, tsbockman wrote:
1) It does not say that level 2 should be opt-in; it says 
that level 2 should be toggle-able. Nowhere does it say which 
of level 1 and 2 should be the default.


2) It says that working with graphemes is slower than UTF-16 
code UNITS (level 1), but says nothing about streaming 
decoding of code POINTS (what we have).


3) That document is from 2000, and its claims about 
performance are surely extremely out-dated, anyway. Computers 
and the Unicode standard have both changed much since then.


1) Right because a special toggleable syntax is definitely not 
"opt-in".


It is not "opt-in" unless it is toggled off by default. The 
only reason it doesn't talk about toggling in the level 1 
section, is because that section is written with the assumption 
that many programs will *only* support level 1.




*sigh* reading comprehension. Needing to write .byGrapheme or 
similar to enable the behaviour qualifies for what that 
description was arguing for. I hope you understand that now that 
I am repeating this for you.


2) Several people in this thread noted that working on 
graphemes is way slower (which makes sense, because its yet 
another processing you need to do after you decoded - 
therefore more work - therefore slower) than working on code 
points.


And working on code points is way slower than working on code 
units (the actual level 1).




Never claimed the opposite. Do note however that its specifically 
talking about UTF-16 code units.



3) Not an argument - doing more work makes code slower.


What do you think I'm arguing for? It's not 
graphemes-by-default.


Unrelated. I was refuting the point you made about the relevance 
of the performance claims of the unicode level 2 support 
description, not evaluating your hypothetical design. Please do 
not take what I say out of context, thank you.

Re: The Case Against Autodecode

2016-06-02 Thread Timon Gehr via Digitalmars-d


On 02.06.2016 23:46, Andrei Alexandrescu wrote:

On 6/2/16 5:43 PM, Timon Gehr wrote:


.̂ ̪.̂

(Copy-paste it somewhere else, I think it might not be rendered
correctly on the forum.)

The point is that if I do:

".̂ ̪.̂".normalize!NFC.byGrapheme.findAmong([Grapheme("."),Grapheme(",")])

no match is returned.

If I use your method with dchars, I will get spurious matches. I.e. the
suggested method to look for punctuation symbols is incorrect:

writeln(".̂ ̪.̂".findAmong(",.")); // ".̂ ̪.̂"


Nice example.
...


Thanks! :o)


(Also, do you have an use case for this?)


Count delimited words. Did you also look at balancedParens?


Andrei



On 02.06.2016 22:01, Timon Gehr wrote:



* s.balancedParens('〈', '〉') works only with autodecoding.
...


Doesn't work, e.g. s="⟨⃖". Shouldn't compile.


assert("⟨⃖".normalize!NFC.byGrapheme.balancedParens(Grapheme("⟨"),Grapheme("⟩")));

writeln("⟨⃖".balancedParens('⟨','⟩')); // false

Re: The Case Against Autodecode

2016-06-02 Thread Walter Bright via Digitalmars-d

On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote:

The lambda returns bool. -- Andrei

Yes, I was wrong about that. But the point still stands with:

> * s.balancedParens('〈', '〉') works only with autodecoding.
> * s.canFind('ö') works only with autodecoding. It returns always false 
without.

Can be made to work without autodecoding.

Re: The Case Against Autodecode

2016-06-02 Thread Walter Bright via Digitalmars-d


On 6/2/2016 1:12 PM, Timon Gehr wrote:

On 02.06.2016 22:07, Walter Bright wrote:

On 6/2/2016 12:05 PM, Andrei Alexandrescu wrote:

* s.all!(c => c == 'ö') works only with autodecoding. It returns
always false
without.


The o is inferred as a wchar. The lamda then is inferred to return a
wchar.


No, the lambda returns a bool.


Thanks for the correction.



The algorithm can check that the input is char[], and is being
tested against a wchar. Therefore, the algorithm can specialize to do
the decoding itself.

No autodecoding necessary, and it does the right thing.


It still would not be the right thing. The lambda shouldn't compile. It is not
meaningful to compare utf-8 and utf-16 code units directly.


Yes, you have a good point. But we do allow things like:

   byte b;
   if (b == 1) ...

Re: The Case Against Autodecode

2016-06-02 Thread tsbockman via Digitalmars-d


On Thursday, 2 June 2016 at 21:38:02 UTC, default0 wrote:

On Thursday, 2 June 2016 at 21:30:51 UTC, tsbockman wrote:
1) It does not say that level 2 should be opt-in; it says that 
level 2 should be toggle-able. Nowhere does it say which of 
level 1 and 2 should be the default.


2) It says that working with graphemes is slower than UTF-16 
code UNITS (level 1), but says nothing about streaming 
decoding of code POINTS (what we have).


3) That document is from 2000, and its claims about 
performance are surely extremely out-dated, anyway. Computers 
and the Unicode standard have both changed much since then.


1) Right because a special toggleable syntax is definitely not 
"opt-in".


It is not "opt-in" unless it is toggled off by default. The only 
reason it doesn't talk about toggling in the level 1 section, is 
because that section is written with the assumption that many 
programs will *only* support level 1.


2) Several people in this thread noted that working on 
graphemes is way slower (which makes sense, because its yet 
another processing you need to do after you decoded - therefore 
more work - therefore slower) than working on code points.


And working on code points is way slower than working on code 
units (the actual level 1).



3) Not an argument - doing more work makes code slower.


What do you think I'm arguing for? It's not graphemes-by-default.

What I actually want to see: permanently deprecate the 
auto-decoding range primitives. Force the user to explicitly 
specify whichever of `by!dchar`, `byCodePoint`, or `byGrapheme` 
their specific algorithm actually needs. Removing the implicit 
conversions between `char`, `wchar`, and `dchar` would also be 
nice, but isn't really necessary I think.


That would be a standards-compliant solution (one of several 
possible). What we have now is non-standard, at least going by 
the old version Walter linked.

Re: The Case Against Autodecode

2016-06-02 Thread Andrei Alexandrescu via Digitalmars-d


On 6/2/16 5:43 PM, Timon Gehr wrote:


.̂ ̪.̂

(Copy-paste it somewhere else, I think it might not be rendered
correctly on the forum.)

The point is that if I do:

".̂ ̪.̂".normalize!NFC.byGrapheme.findAmong([Grapheme("."),Grapheme(",")])

no match is returned.

If I use your method with dchars, I will get spurious matches. I.e. the
suggested method to look for punctuation symbols is incorrect:

writeln(".̂ ̪.̂".findAmong(",.")); // ".̂ ̪.̂"


Nice example.


(Also, do you have an use case for this?)


Count delimited words. Did you also look at balancedParens?


Andrei

Re: The Case Against Autodecode

2016-06-02 Thread Andrei Alexandrescu via Digitalmars-d


On 6/2/16 5:38 PM, cym13 wrote:

Allow me to try another angle:

- There are different levels of unicode support and you don't want to
support them all transparently. That's understandable.


Cool.


- The level you choose to support is the code point level. There are
many good arguments about why this isn't a good default but you won't
change your mind. I don't like that at all and I'm not alone but let's
forget the entirety of the vocal D community for a moment.


You mean all 35 of them?

It's not about changing my mind! A massive thing that the code point 
level handling is the incumbent, and that changing it would need to mark 
an absolutely Earth-shattering improvement to be worth it!



- A huge part of unicode chars can be normalized to fit your
definition. That way not everything work (far from it) but a
sufficiently big subset works.


Cool.


- On the other hand without normalization it just doesn't make any
sense from a user perspective.The ö example has clearly shown that
much, you even admitted it yourself by stating that many counter
arguments would have worked had the string been normalized).


Yah, operating at code point level does not come free of caveats. It is 
vastly superior to operating on code units, and did I mention it's the 
incumbent.



- The most proeminent problem is with graphems that can have different
representations as those that can't be normalized can't be searched as
dchars as well.


Yah, I'd say if the program needs graphemes the option is there. Phobos 
by default deals with code points which are not perfect but are 
independent of representation, produce meaningful and consistent results 
with std.algorithm etc.



- If autodecoding to code points is to stay and in an effort to find a
compromise then normalizing should be done by default. Sure it would
take some more time but it wouldn't break any code (I think) and would
actually make things more correct. They still wouldn't be correct but
I feel that something as crazy as unicode cannot be tackled
generically anyway.


Some more work on normalization at strategic points in Phobos would be 
interesting!



Andrei

Re: The Case Against Autodecode

2016-06-02 Thread Timon Gehr via Digitalmars-d


On 02.06.2016 23:23, Andrei Alexandrescu wrote:

On 6/2/16 5:19 PM, Timon Gehr wrote:

On 02.06.2016 23:16, Timon Gehr wrote:

On 02.06.2016 23:06, Andrei Alexandrescu wrote:

As the examples show, the examples would be entirely meaningless at
code
unit level.


So far, I needed to count the number of characters 'ö' inside some
string exactly zero times,


(Obviously this isn't even what the example would do. I predict I will
never need to count the number of code points 'ö' by calling some
function from std.algorithm directly.)


You may look for a specific dchar, and it'll work. How about
findAmong("...") with a bunch of ASCII and Unicode punctuation symbols?
-- Andrei




.̂ ̪.̂

(Copy-paste it somewhere else, I think it might not be rendered 
correctly on the forum.)


The point is that if I do:

".̂ ̪.̂".normalize!NFC.byGrapheme.findAmong([Grapheme("."),Grapheme(",")])

no match is returned.

If I use your method with dchars, I will get spurious matches. I.e. the 
suggested method to look for punctuation symbols is incorrect:


writeln(".̂ ̪.̂".findAmong(",.")); // ".̂ ̪.̂"


(Also, do you have an use case for this?)

Re: The Case Against Autodecode

2016-06-02 Thread Andrei Alexandrescu via Digitalmars-d


On 6/2/16 5:38 PM, deadalnix wrote:

On Thursday, 2 June 2016 at 21:37:11 UTC, Andrei Alexandrescu wrote:

On 6/2/16 5:35 PM, deadalnix wrote:

On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:

On 6/2/16 5:20 PM, deadalnix wrote:

The good thing when you define works by whatever it does right now


No, it works as it was designed. -- Andrei


Nobody says it doesn't. Everybody says the design is crap.


I think I like it more after this thread. -- Andrei


You start reminding me of the joke with that guy complaining that
everybody is going backward on the highway.


Touché. (Get it?) -- Andrei

Re: The Case Against Autodecode

2016-06-02 Thread Andrei Alexandrescu via Digitalmars-d


On 6/2/16 5:37 PM, Andrei Alexandrescu wrote:

On 6/2/16 5:35 PM, deadalnix wrote:

On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:

On 6/2/16 5:20 PM, deadalnix wrote:

The good thing when you define works by whatever it does right now


No, it works as it was designed. -- Andrei


Nobody says it doesn't. Everybody says the design is crap.


I think I like it more after this thread. -- Andrei


Meh, thinking of it again: I don't like it more, I'd still do it 
differently given a clean slate (viz. RCStr). But let's say I didn't get 
many compelling reasons to remove autodecoding from this thread. -- Andrei

Re: The Case Against Autodecode

2016-06-02 Thread cym13 via Digitalmars-d

On Thursday, 2 June 2016 at 20:29:48 UTC, Andrei Alexandrescu 
wrote:

On 06/02/2016 04:22 PM, cym13 wrote:


A:“We should decode to code points”
B:“No, decoding to code points is a stupid idea.”
A:“No it's not!”
B:“Can you show a concrete example where it does something 
useful?”

A:“Sure, look at that!”
B:“This isn't working at all, look at all those 
counter-examples!”

A:“It may not work for your examples but look how easy it is to
find code points!”


With autodecoding all of std.algorithm operates correctly on 
code points. Without it all it does for strings is gibberish. 
-- Andrei


Allow me to try another angle:

- There are different levels of unicode support and you don't 
want to

support them all transparently. That's understandable.

- The level you choose to support is the code point level. There 
are
many good arguments about why this isn't a good default but you 
won't
change your mind. I don't like that at all and I'm not alone but 
let's

forget the entirety of the vocal D community for a moment.

- A huge part of unicode chars can be normalized to fit your
definition. That way not everything work (far from it) but a
sufficiently big subset works.

- On the other hand without normalization it just doesn't make any
sense from a user perspective.The ö example has clearly shown that
much, you even admitted it yourself by stating that many counter
arguments would have worked had the string been normalized).

- The most proeminent problem is with graphems that can have 
different
representations as those that can't be normalized can't be 
searched as

dchars as well.

- If autodecoding to code points is to stay and in an effort to 
find a
compromise then normalizing should be done by default. Sure it 
would
take some more time but it wouldn't break any code (I think) and 
would
actually make things more correct. They still wouldn't be correct 
but

I feel that something as crazy as unicode cannot be tackled
generically anyway.

Re: The Case Against Autodecode

2016-06-02 Thread deadalnix via Digitalmars-d

On Thursday, 2 June 2016 at 21:37:11 UTC, Andrei Alexandrescu 
wrote:

On 6/2/16 5:35 PM, deadalnix wrote:
On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu 
wrote:

On 6/2/16 5:20 PM, deadalnix wrote:
The good thing when you define works by whatever it does 
right now


No, it works as it was designed. -- Andrei


Nobody says it doesn't. Everybody says the design is crap.


I think I like it more after this thread. -- Andrei


You start reminding me of the joke with that guy complaining that 
everybody is going backward on the highway.

Re: The Case Against Autodecode

2016-06-02 Thread default0 via Digitalmars-d


On Thursday, 2 June 2016 at 21:30:51 UTC, tsbockman wrote:

On Thursday, 2 June 2016 at 21:07:19 UTC, default0 wrote:
The level 2 support description noted that it should be opt-in 
because its slow.


1) It does not say that level 2 should be opt-in; it says that 
level 2 should be toggle-able. Nowhere does it say which of 
level 1 and 2 should be the default.


2) It says that working with graphemes is slower than UTF-16 
code UNITS (level 1), but says nothing about streaming decoding 
of code POINTS (what we have).


3) That document is from 2000, and its claims about performance 
are surely extremely out-dated, anyway. Computers and the 
Unicode standard have both changed much since then.


1) Right because a special toggleable syntax is definitely not 
"opt-in".
2) Several people in this thread noted that working on graphemes 
is way slower (which makes sense, because its yet another 
processing you need to do after you decoded - therefore more work 
- therefore slower) than working on code points.
3) Not an argument - doing more work makes code slower. The only 
thing that changes is what specific operations have what cost 
(for instance, memory access has a much higher cost now than it 
had then). Considering the way the process works and judging from 
what others in this thread have said about it, I will stick with 
"always decoding to graphemes for all operations is very slow" 
and indulge in being too lazy to write benchmarks for it to show 
just how bad it is.

Re: The Case Against Autodecode

2016-06-02 Thread Andrei Alexandrescu via Digitalmars-d


On 6/2/16 5:35 PM, deadalnix wrote:

On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:

On 6/2/16 5:20 PM, deadalnix wrote:

The good thing when you define works by whatever it does right now


No, it works as it was designed. -- Andrei


Nobody says it doesn't. Everybody says the design is crap.


I think I like it more after this thread. -- Andrei

Re: The Case Against Autodecode

2016-06-02 Thread Andrei Alexandrescu via Digitalmars-d


On 6/2/16 5:35 PM, ag0aep6g wrote:

On 06/02/2016 11:27 PM, Andrei Alexandrescu wrote:

On 6/2/16 5:24 PM, ag0aep6g wrote:

On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote:

Nope, that's a radically different matter. As the examples show, the
examples would be entirely meaningless at code unit level.


They're simply not possible. Won't compile.


They do compile.


Yes, you're right, of course they do. char implicitly converts to dchar.
I didn't think of that anti-feature.


As I said: this thread produces an unpleasant amount of arguments in
favor of autodecoding. Even I don't like that :o).


It's more of an argument against char : dchar, I'd say.


I do think that's an interesting option in PL design space, but that 
would be super disruptive. -- Andrei

Re: The Case Against Autodecode

2016-06-02 Thread deadalnix via Digitalmars-d

On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu 
wrote:

On 6/2/16 5:20 PM, deadalnix wrote:
The good thing when you define works by whatever it does right 
now


No, it works as it was designed. -- Andrei


Nobody says it doesn't. Everybody says the design is crap.

Re: The Case Against Autodecode

2016-06-02 Thread ag0aep6g via Digitalmars-d


On 06/02/2016 11:27 PM, Andrei Alexandrescu wrote:

On 6/2/16 5:24 PM, ag0aep6g wrote:

On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote:

Nope, that's a radically different matter. As the examples show, the
examples would be entirely meaningless at code unit level.


They're simply not possible. Won't compile.


They do compile.


Yes, you're right, of course they do. char implicitly converts to dchar. 
I didn't think of that anti-feature.



As I said: this thread produces an unpleasant amount of arguments in
favor of autodecoding. Even I don't like that :o).


It's more of an argument against char : dchar, I'd say.

Re: The Case Against Autodecode

2016-06-02 Thread tsbockman via Digitalmars-d


On Thursday, 2 June 2016 at 21:07:19 UTC, default0 wrote:
The level 2 support description noted that it should be opt-in 
because its slow.


1) It does not say that level 2 should be opt-in; it says that 
level 2 should be toggle-able. Nowhere does it say which of level 
1 and 2 should be the default.


2) It says that working with graphemes is slower than UTF-16 code 
UNITS (level 1), but says nothing about streaming decoding of 
code POINTS (what we have).


3) That document is from 2000, and its claims about performance 
are surely extremely out-dated, anyway. Computers and the Unicode 
standard have both changed much since then.

Re: The Case Against Autodecode

2016-06-02 Thread Andrei Alexandrescu via Digitalmars-d


On 6/2/16 5:27 PM, Andrei Alexandrescu wrote:

On 6/2/16 5:24 PM, ag0aep6g wrote:

Just like there is no single code point for 'a⃗' so you can't
search for it in a range of code points.


Of course you can.


Correx, indeed you can't. -- Andrei

Re: The Case Against Autodecode

2016-06-02 Thread Timon Gehr via Digitalmars-d


On 02.06.2016 22:51, Andrei Alexandrescu wrote:

On 06/02/2016 04:50 PM, Timon Gehr wrote:

On 02.06.2016 22:28, Andrei Alexandrescu wrote:

On 06/02/2016 04:12 PM, Timon Gehr wrote:

It is not meaningful to compare utf-8 and utf-16 code units directly.


But it is meaningful to compare Unicode code points. -- Andrei



It is also meaningful to compare two utf-8 code units or two utf-16 code
units.


By decoding them of course. -- Andrei



That makes no sense, I cannot decode single code units.

BTW, I guess the reason why char converts to wchar converts to dchar is 
that the lower half of code units in char and the lower half of code 
units in wchar are code points. Maybe code units and code points with 
low numerical values should have distinct types.

Re: The Case Against Autodecode

2016-06-02 Thread Andrei Alexandrescu via Digitalmars-d


On 6/2/16 5:20 PM, deadalnix wrote:

The good thing when you define works by whatever it does right now


No, it works as it was designed. -- Andrei

Re: The Case Against Autodecode

2016-06-02 Thread Andrei Alexandrescu via Digitalmars-d


On 6/2/16 5:23 PM, Timon Gehr wrote:

On 02.06.2016 22:51, Andrei Alexandrescu wrote:

On 06/02/2016 04:50 PM, Timon Gehr wrote:

On 02.06.2016 22:28, Andrei Alexandrescu wrote:

On 06/02/2016 04:12 PM, Timon Gehr wrote:

It is not meaningful to compare utf-8 and utf-16 code units directly.


But it is meaningful to compare Unicode code points. -- Andrei



It is also meaningful to compare two utf-8 code units or two utf-16 code
units.


By decoding them of course. -- Andrei



That makes no sense, I cannot decode single code units.

BTW, I guess the reason why char converts to wchar converts to dchar is
that the lower half of code units in char and the lower half of code
units in wchar are code points. Maybe code units and code points with
low numerical values should have distinct types.


Then you lost me. (I'm sure you're making a good point.) -- Andrei

Re: The Case Against Autodecode

2016-06-02 Thread Timon Gehr via Digitalmars-d


On 02.06.2016 23:20, deadalnix wrote:


The sample code won't count the instance of the grapheme 'ö' as some of
its encoding won't be counted, which definitively count as doesn't work.


It also has false positives (you can combine 'ö' with some combining 
character in order to get some strange character that is not an 'ö', and 
not even NFC helps with that).

Re: The Case Against Autodecode

2016-06-02 Thread Andrei Alexandrescu via Digitalmars-d


On 6/2/16 5:24 PM, ag0aep6g wrote:

On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote:

Nope, that's a radically different matter. As the examples show, the
examples would be entirely meaningless at code unit level.


They're simply not possible. Won't compile.


They do compile.


There is no single UTF-8
code unit for 'ö', so you can't (easily) search for it in a range for
code units.


Of course you can. Can you search for an int in a short[]? Oh yes you 
can. Can you search for a dchar in a char[]? Of course you can. 
Autodecoding also gives it meaning.



Just like there is no single code point for 'a⃗' so you can't
search for it in a range of code points.


Of course you can.


You can still search for 'a', and 'o', and the rest of ASCII in a range
of code units.


You can search for a dchar in a char[] because you can compare an 
individual dchar with either another dchar (correct, autodecoding) or 
with a char (incorrect, no autodecoding).


As I said: this thread produces an unpleasant amount of arguments in 
favor of autodecoding. Even I don't like that :o).



Andrei

Re: The Case Against Autodecode

2016-06-02 Thread ag0aep6g via Digitalmars-d


On 06/02/2016 11:24 PM, ag0aep6g wrote:

They're simply not possible. Won't compile. There is no single UTF-8
code unit for 'ö', so you can't (easily) search for it in a range for
code units. Just like there is no single code point for 'a⃗' so you can't
search for it in a range of code points.

You can still search for 'a', and 'o', and the rest of ASCII in a range
of code units.


I'm ignoring combining characters there. You can search for 'a' in code 
units in the same way that you can search for 'ä' in code points. I.e., 
more or less, depending on how serious you are about combining characters.

Re: The Case Against Autodecode

2016-06-02 Thread deadalnix via Digitalmars-d


On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote:

On 6/2/2016 12:34 PM, deadalnix wrote:
On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu 
wrote:
Pretty much everything. Consider s and s1 string variables 
with possibly

different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It 
returns always false

without.



False. Many characters can be represented by different 
sequences of codepoints.
For instance, ê can be ê as one codepoint or ^ as a modifier 
followed by e. ö is

one such character.


There are 3 levels of Unicode support. What Andrei is talking 
about is Level 1.


http://unicode.org/reports/tr18/tr18-5.1.html

I wonder what rationale there is for Unicode to have two 
different sequences of codepoints be treated as the same. It's 
madness.


To be able to convert back and forth from/to unicode in a 
lossless manner.

Re: The Case Against Autodecode

2016-06-02 Thread ag0aep6g via Digitalmars-d


On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote:

Nope, that's a radically different matter. As the examples show, the
examples would be entirely meaningless at code unit level.


They're simply not possible. Won't compile. There is no single UTF-8 
code unit for 'ö', so you can't (easily) search for it in a range for 
code units. Just like there is no single code point for 'a⃗' so you can't 
search for it in a range of code points.


You can still search for 'a', and 'o', and the rest of ASCII in a range 
of code units.

Re: The Case Against Autodecode

2016-06-02 Thread Andrei Alexandrescu via Digitalmars-d


On 6/2/16 5:19 PM, Timon Gehr wrote:

On 02.06.2016 23:16, Timon Gehr wrote:

On 02.06.2016 23:06, Andrei Alexandrescu wrote:

As the examples show, the examples would be entirely meaningless at code
unit level.


So far, I needed to count the number of characters 'ö' inside some
string exactly zero times,


(Obviously this isn't even what the example would do. I predict I will
never need to count the number of code points 'ö' by calling some
function from std.algorithm directly.)


You may look for a specific dchar, and it'll work. How about 
findAmong("...") with a bunch of ASCII and Unicode punctuation symbols? 
-- Andrei

Re: The Case Against Autodecode

2016-06-02 Thread deadalnix via Digitalmars-d

On Thursday, 2 June 2016 at 20:13:52 UTC, Andrei Alexandrescu 
wrote:

On 06/02/2016 03:34 PM, deadalnix wrote:
On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu 
wrote:
Pretty much everything. Consider s and s1 string variables 
with

possibly different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It 
returns

always false without.



False.


True. "Are all code points equal to this one?" -- Andrei


The good thing when you define works by whatever it does right 
now, it is that everything always works and there are literally 
never any bug. The bad thing is that this is a completely useless 
definition of work.


The sample code won't count the instance of the grapheme 'ö' as 
some of its encoding won't be counted, which definitively count 
as doesn't work.


When your point need to redefine words in ways that nobody agree 
with, it is time to admit the point is bogus.

Re: The Case Against Autodecode

2016-06-02 Thread Timon Gehr via Digitalmars-d


On 02.06.2016 23:06, Andrei Alexandrescu wrote:

As the examples show, the examples would be entirely meaningless at code
unit level.


So far, I needed to count the number of characters 'ö' inside some 
string exactly zero times, but I wanted to chain or join strings 
relatively often.

Re: The Case Against Autodecode

2016-06-02 Thread Andrei Alexandrescu via Digitalmars-d


On 6/2/16 5:05 PM, tsbockman wrote:

On Thursday, 2 June 2016 at 20:56:26 UTC, Walter Bright wrote:

What is supposed to be done with "do not merge" PRs other than close
them?


Occasionally people need to try something on the auto tester (not sure
if that's relevant to that particular PR, though). Presumably if someone
marks their own PR as "do not merge", it means they're planning to
either close it themselves after it has served its purpose, or they plan
to fix/finish it and then remove the "do not merge" label.


Feel free to reopen if it helps, it wasn't closed in anger. -- Andrei

Re: The Case Against Autodecode

2016-06-02 Thread default0 via Digitalmars-d


On Thursday, 2 June 2016 at 20:52:29 UTC, ag0aep6g wrote:

On 06/02/2016 10:36 PM, Andrei Alexandrescu wrote:
By whom? The "support level 1" folks yonder at the Unicode 
standard? :o)

-- Andrei


Do they say that level 1 should be the default, and do they 
give a rationale for that? Would you kindly link or quote that?


The level 2 support description noted that it should be opt-in 
because its slow.
Arguably it should be easier to operate on code units if you know 
its safe to do so, but either always working on code units or 
always working on graphemes as the default seems to be either too 
broken too often or too slow too often.


Now one can argue either consistency for code units (because then 
we can treat char[] and friends as a slice) or correctness for 
graphemes but really the more I think about it the more I think 
there is no good default and you need to learn unicode anyways. 
The only sad parts here are that 1) we hijacked an array type for 
strings, which sucks and 2) that we dont have an api that is 
actually good at teaching the user what it does and doesnt do.


The consequence of 1 is that generic code that also wants to deal 
with strings will want to special-case to get rid of 
auto-decoding, the consequence of 2 is that we will have tons of 
not-actually-correct string handling.
I would assume that almost all string handling code that is out 
in the wild is broken anyways (in code I have encountered I have 
never seen attempts to normalize or do other things before or 
after comparisons, searching, etc), unless of course, YOU or one 
of your colleagues wrote it (consider that checking the length of 
a string in Java or C# to validate it is no longer than X 
characters is often done and wrong, because .Length is the number 
of UTF-16 code units in those languages) :o)


So really as bad and alarming as "incorrect string handling" by 
default seems, it in practice of other languages that get used 
way more than D has not prevented people from writing working 
(internationalized!) applications in those languages.
One could say we should do it better than them, but I would be 
inclined to believe that RCStr provides our opportunity to do so. 
Having char[] be what it is is an annoying wart, and maybe at 
some point we can deprecate/remove that behaviour, but for now Id 
rather see if RCStr is viable than attempt to change semantics of 
all string handling code in D.

Re: The Case Against Autodecode

2016-06-02 Thread tsbockman via Digitalmars-d


On Thursday, 2 June 2016 at 20:56:26 UTC, Walter Bright wrote:
What is supposed to be done with "do not merge" PRs other than 
close them?


Occasionally people need to try something on the auto tester (not 
sure if that's relevant to that particular PR, though). 
Presumably if someone marks their own PR as "do not merge", it 
means they're planning to either close it themselves after it has 
served its purpose, or they plan to fix/finish it and then remove 
the "do not merge" label.


Either way, they shouldn't be closed just because they say "do 
not merge" (unless they're abandoned or something, obviously).

Re: The Case Against Autodecode

2016-06-02 Thread Andrei Alexandrescu via Digitalmars-d


On 6/2/16 5:01 PM, ag0aep6g wrote:

On 06/02/2016 10:50 PM, Andrei Alexandrescu wrote:

It does not fall apart for code points.


Yes it does. You've been given plenty examples where it falls apart.


There weren't any.


Your answer to that was that it operates on code points, not graphemes.


That is correct.


Well, duh. Comparing UTF-8 code units against each other works, too.
That's not an argument for doing that by default.


Nope, that's a radically different matter. As the examples show, the 
examples would be entirely meaningless at code unit level.



Andrei

Re: The Case Against Autodecode

2016-06-02 Thread ag0aep6g via Digitalmars-d


On 06/02/2016 10:50 PM, Andrei Alexandrescu wrote:

It does not fall apart for code points.


Yes it does. You've been given plenty examples where it falls apart. 
Your answer to that was that it operates on code points, not graphemes. 
Well, duh. Comparing UTF-8 code units against each other works, too. 
That's not an argument for doing that by default.

Re: The Case Against Autodecode

2016-06-02 Thread tsbockman via Digitalmars-d

On Thursday, 2 June 2016 at 20:49:52 UTC, Andrei Alexandrescu 
wrote:

On 06/02/2016 04:47 PM, tsbockman wrote:
That doesn't sound like much of an endorsement for defaulting 
to only
level 1 support to me - "it does not handle more complex 
languages or

extensions to the Unicode Standard very well".


Code point/Level 1 support sounds like a sweet spot between 
efficiency/complexity and conviviality. Level 2 is opt-in with 
byGrapheme. -- Andrei


Actually, according to the document Walter Bright linked level 1 
does NOT operate at the code point level:


Level 1: Basic Unicode Support. At this level, the regular 
expression engine provides support for Unicode characters as 
basic 16-bit logical units. (This is independent of the actual 
serialization of Unicode as UTF-8, UTF-16BE, UTF-16LE, or 
UTF-32.)

...
Level 1 support works well in many circumstances. However, it 
does not handle more complex languages or extensions to the 
Unicode Standard very well. Particularly important cases are 
**surrogates** ...


So, level 1 appears to be UTF-16 code units, not code points. To 
do code points it would have to recognize surrogates, which are 
specifically mentioned as not supported.


Level 2 skips straight to graphemes, and there is no code point 
level.


However, this document is very old - from Unicode 3.0 and the 
year 2000:


While there are no surrogate characters in Unicode 3.0 (outside 
of private use characters), future versions of Unicode will 
contain them...


Perhaps level 1 has since been redefined?

Re: The Case Against Autodecode

2016-06-02 Thread Jack Stouffer via Digitalmars-d


On Thursday, 2 June 2016 at 20:56:26 UTC, Walter Bright wrote:
What is supposed to be done with "do not merge" PRs other than 
close them?


Experimentally iterate until something workable comes about. This 
way it's done publicly and people can collaborate.

Re: The Case Against Autodecode

2016-06-02 Thread Walter Bright via Digitalmars-d


On 6/2/2016 1:46 PM, Adam D. Ruppe wrote:

The compiler can help you with that. That's the point of the do not merge PR: it
got an actionable list out of the compiler and proved the way forward was 
viable.


What is supposed to be done with "do not merge" PRs other than close them?

Re: The Case Against Autodecode

2016-06-02 Thread Andrei Alexandrescu via Digitalmars-d


On 06/02/2016 04:52 PM, ag0aep6g wrote:

On 06/02/2016 10:36 PM, Andrei Alexandrescu wrote:

By whom? The "support level 1" folks yonder at the Unicode standard? :o)
-- Andrei


Do they say that level 1 should be the default, and do they give a
rationale for that? Would you kindly link or quote that?


No, but that sounds agreeable to me, especially since it breaks no code 
of ours.


We really should document this better. Kudos to Walter for finding all 
that Level 1 support.



Andrei

1 2 3 4 5 >

1 - 100 of 427 matches

Mail list logo