I think we’re a long way from handling Hiragana, Katakana or Kanji.  Probably 
the same for Tamil or Telegu, although I know nothing about them.

So why do we scan strings?  We do it when tokenizing, but all our tokens are 
roman (if not ascii), so that should be OK.

We also do it looking for numbers to increment.  We’d like this to work for 
other languages, but as long as their subsequent-code-points don’t look like 
roman digits I think we’re OK.  (Subsequent-code-points are all above ascii, 
right?).

We do some case conversions when doing compares.  But again, as long as 
subsequent code points don’t look like ascii we should be OK.  I assume 
capitalization algorithms don’t try to do it on Romanji or other 
non-ascii-coded roman characters?

When else do we scan strings?

> On 30 Apr 2019, at 21:35, John Beard <john.j.be...@gmail.com> wrote:
> 
> On 30/04/2019 18:19, Jeff Young wrote:
>> I was referring to UCS-2 or UCS-4.  I’m evidently behind the times, though, 
>> because I now see that UTF-32 and UCS-4 are equivalent.
>> (Which means that both some of John’s original premises and my quote in teal 
>> below were wrong: UTF32 is indeed a one:one map between code points and 
>> chars.)
> 
> Kind of, depending the on definition of character. As long as you never get 
> any multi-code point "characters".
> 
>> So my proposal (in 2019) should be std::u32string (using UTF32 encoding, for 
>> which myString[3] still works).
> 
> By "works", what do you mean? Sure you can index into a UTF-32 string and 
> come up with a valid (whole) code point (and a valid code unit). But that 
> doesn't mean a lot: it could be the "ᄀ" (\u1100) from 가, which is actually 2 
> code points.
> 
> How often do we actually index into a string buffer by code point anyway, 
> without iterating the string to find something first? What does that even 
> mean in the context of a Unicode string?
> 
> Graphemes are not a strange and ignorable edge case: emojis may sound silly, 
> but lots of actual languages use grapheme clusters perfectly casually (Tamil, 
> Telegu[1], Hangul as above, etc). You either support Unicode or you don't, 
> you cannot pick and choose what is "reasonable" to support.
> 
> BTW, UTF-8 is does allow you to index into it by byte and see if you're on a 
> code point boundary (if the byte starts 0b10xxxxxx, you are not). You can't 
> index to the n'th code point (but for what purpose?) and you still can't 
> index to the n'th grapheme, but you can't do that in *any* encoding.
> 
>> Better?
> 
> As long as we save our files as UTF-8, I don't really mind what we use 
> internally. But if you actually plan to manipulate strings that could be 
> Unicode and it comes from a user, you cannot do it only by code point, 
> regardless of representation.
> 
> Cheers,
> 
> John
> 
> [1]: Mishandling of Telegu produced the iPhone SMS of Death bug.


_______________________________________________
Mailing list: https://launchpad.net/~kicad-developers
Post to     : kicad-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kicad-developers
More help   : https://help.launchpad.net/ListHelp

Reply via email to