Re: Unicode Normalization (and graphemes and locales)

Steven Schveighoffer via Digitalmars-d Fri, 03 Jun 2016 04:41:59 -0700

On 6/3/16 2:24 AM, Jonathan M Davis via Digitalmars-d wrote:

On Thursday, June 02, 2016 17:14:13 Walter Bright via Digitalmars-d wrote:

On 6/2/2016 4:29 PM, Jonathan M Davis via Digitalmars-d wrote:
 > How do you suggest that we handle the normalization issue? Should we just
 > assume NFC like std.uni.normalize does and provide an optional template
 > argument to indicate a different normalization (like normalize does)?
 > Since
 > without providing a way to deal with the normalization, we're not
 > actually
 > making the code fully correct, just faster.


The short answer is, we don't.


I generally agree. The main problem that I was concerned about were the
cases like find where we're talking about encoding the needle to match the
haystack so that we can compare with code units, and I was thinking that
we'd be forced to pick a normalization scheme with that, and if that didn't
match the normalization of the haystack, we'd be in trouble (hence the
concern about being able to specify a normalization scheme). However,
thinking about it further, that's not actually a problem. If the needle is a
dchar, then code point normalization isn't an issue, because it's only ever
one code point, and if the needle uses a different encoding (e.g. UTF-16
instead of UTF-8), and we re-encode it with the encoding of the haystack,
that doesn't change the normalization of the needle. Even if the code units
have changed, the code points that they represent are the same. So, it
doesn't even potentially make sense to try and doing anything with the
normalization when re-encoding the needle.


But consider the case where you are searching the string: "cassé"

for the letter 'e'. If é is encoded as 'e' + U+0301, then you willsucceed when you should fail! However, it may be that you actually wantto find specifically any code points with 'e', including ones withcombining characters. This is why we really need more discretion fromPhobos, and less hand-holding.

There are certainly searches that will be correct. For example,searching for newline should always work in code-point space. Actually,what happens when you use a combining character on newline? Is it aninvalid unicode sequence? Does it matter? :)

A nice function to determine whether code points or graphemes arerequired for comparison given a needle may be useful.


-Steve

Re: Unicode Normalization (and graphemes and locales)

Reply via email to