Re: [bitc-dev] BitC Strings and Unicode

Ben Kloosterman Wed, 15 Oct 2014 22:02:27 -0700

Yep agree .

1. Do we all buy the story that Strings are conceptually used for text?

BK> Yes.. and its important that the standard-lib is littered with one
string type.

2. Does following set of rules for strings make sense? If no, why not?

   - Strings are normalized via NFC
   - String operations preserve NFC encoding

BK> Not sure if you cant treat this as granted .. say web sites UTF8 and
Xml messages do you really want to parse and rearrange all that data to
ensure NFC compliance ?

I would say its desirable but not guaranteed.

   - Strings are encoded in UTF-8   . Strings are indexed *by the byte*

BK> I have always been of the opinion that strings should be Utf-8  and NOT
user index-able and by index-able i define that as string[n]   . It sends
the wrong message  that its a mutable array.

Strings may return a arbitrary offset index  ( prob better a custom
stringOffset type so a user cant manipulate it). This offset is likely to
be a byte offset but that is not the users concern  , and it should just be
used as an input into string functions  as an offset . If you want to self
index be explicit str.GetCharArray() .

Note often if you want to work  on indexing eg high performance stdio you
don't want to work with immutable strings anyway but mutable char arrays
 in which case unless you have a separate Unicode lib  for working on
mutable arrays your either in a world of hurt ( manual i18n  on
char points and so may need to handle non unicode like Big8)  , ignore i18n
and assume  English char behavior   or still using ascii. ) . So the i18n
lib which strings use should have features exposed for custom mutable array
operations.

In terms of performance / compatibility of old algorithms/ benchmarks   ascii
is still important. For this reason its important to know  that the UTF8 is
ascii.  This allows str.GetCharArray()  simply to copy the underlying
string array or better yet pass a read-only slice without running through an
encoding converter. So  a bit somewhere  to set  it to Ascii and the
 compiler should do this for constants etc ( since if UTF8 has no 0x10
escape which can be determined via SIMD cheaply  ) - maybe the
normalization should  be a second bit. In fact even if the bit is not set i
bet parsing it to see if its all ascii before sending to from ascii instead
of from utf8 re-encoder will likely to be more efficient in the vast
majority of cases.

Ben

On Thu, Oct 16, 2014 at 4:21 AM, Jonathan S. Shapiro <[email protected]>
wrote:

> In the middle of a billion other things, I've been digging into Unicode a
> bit. In the process, I happened across this blog entry by Armin Ronacher:
>
> Everything You Did Not Want to Know about Unicode in Python 3
> <http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/>
>
>
> It's a nice gathering of a bunch of annoying issues. I've sort of been
> trundling along on the strength of Mark Miller's recommendation to just use
> the NFC normalization without really digging in to that. Yesterday, I had a
> chat session with Tim Čas, who pointed me at something I hadn't known. I
> had assumed that all of the combining marks in Unicode had corresponding
> combined forms. That turns out not to be true, and Tim pointed at me
> examples.
>
> This is very nasty thing to learn, because it means one cannot assume that
> UCS4 code points are self-contained. It also means that there is absolutely
> no hope of correctly processing *text* by viewing it as a sequence of
> singleton code points.
>
> And just to hammer the point home, neither NFC nor NFD are closed under
> concatention; there are cases where you need to rearrange marks to
> re-establish normalization. Note this has some nasty implications. If s =
> s1 + s2 (by normalizing concatenation) then in NFC:
>
> |s| <= |s1| + |s2|  // as opposed to the expected simple sum
> s1[|s1|-1] and/or s2[0] will not necessarily appear anywhere within s.
>
>
> and in NFD:
>
> |s| == |s1| + |s2|
> All code points are retained
> Some code points around the point of concatenation may get reordered to
> preserve normalization
>
>
> OK. So back to my conversation with Tim, who pointed out that the absence
> of fully combined forms means that the entire concept of "a codepoint is
> kind of like a character" is broken. If that's true, *there may be no
> point to viewing a string as an indexable sequence of code points*. In
> fact, that may actually encourage programmers to be doing the wrong thing.
> Tim is working on an interpreter where he's contemplating defining strings
> as indexed over bytes, which seems to be what Go is doing. Basically he's
> saying that code points are already a bad abstraction, and layering another
> bad abstraction on top of things doesn't improve matters.
>
> Meanwhile, Armin's blog post (above) notes that there are a whole lot of
> circumstances where you are handed a bit of text without knowing its
> encoding. If you're just passing it through, the right way to think about
> things is that it's blind byte data. If you actually need to treat it as
> text in some way, then you need to know, or to be told, or to guess
> heuristically what the encoding is so that you can transform that text into
> an encoding that is common with everything else.
>
> And actually, that makes for an important side point: whatever encoding
> you choose, you really want all of your strings to have a common encoding.
>
> All of which brings me to the question: what, conceptually, is a string?
> I'm coming to the view that a string is more than a type wrapper on a
> codepoint vector. Putting something into a string means you think it's
> text, which means you should know what its encoding and normalization
> properties are.
>
> And if it turns out that indexing by codepoint (which is initially O(n))
> isn't really convenient, then the question has to be asked: why bother?
> Hiding the indexing cursor doesn't seem very helpful. If the encoding used
> in strings is UTF-8 or UTF-16 you'll have to deal with variable length
> encodings in any case; hard to see a real good reason to accept O(n)
> indexing *in addition* to that overhead.
>
> So let me pause here and solicit input.
>
> 1. Do we all buy the story that Strings are conceptually used for text?
> 2. Does following set of rules for strings make sense? If no, why not?
>
>    - Strings are normalized via NFC
>    - String operations preserve NFC encoding
>    - Strings are encoded in UTF-8
>    - Strings are indexed *by the byte*
>
>
> Look forward to hearing reactions and thoughts here.
>
>
> Jonathan
>
> _______________________________________________
> bitc-dev mailing list
> [email protected]
> http://www.coyotos.org/mailman/listinfo/bitc-dev
>
>

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] BitC Strings and Unicode

Reply via email to