Re: Why the hell doesn't foreach decode strings

Steven Schveighoffer Mon, 24 Oct 2011 13:55:29 -0700

On Mon, 24 Oct 2011 16:18:57 -0400, Dmitry Olshansky<[email protected]> wrote:

On 24.10.2011 23:41, Steven Schveighoffer wrote:

On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
<[email protected]> wrote:

On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
<[email protected]> wrote:

On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
<[email protected]> wrote:

On 10/22/2011 2:21 AM, Peter Alexander wrote:

Which operations do you believe would be less efficient?


All of the ones that don't require decoding, such as searching,
would be less efficient if decoding was done.


Searching that does not do decoding is fundamentally incorrect. That
is, if you want to find a substring in a string, you cannot just
compare chars.


Assuming both string are valid UTF-8, you can. Continuation bytes can
never
be confused with the first byte of a code point, and the first byte
always
identifies how many continuation bytes there should be.


As others have pointed out in the past to me (and I thought as you did
once), the same characters can be encoded in *different ways*. They must
be normalized to accurately compare.

Assuming language support stays on stage of "codepoint is a character"it's totaly expected to ignore modifiers and compare identicallynormalized UTF without decoding. Yes, it risks to hit certain issues.


Again, the "risk" is that it fails to achieve the goal you ask of it!

D-language: Here, use this search algorithm, it works most of the time,but may not work correctly in some cases. If you run into one of thosecases, you'll have to run a specialized search algorithm for strings.

User: How do I know I hit one of those cases?
D-language: You'll have to run the specialized version to find out.
User: Why wouldn't I just run the specialized version first?
D-language: Well, because it's slower!
User: But don't I have to use both algorithms to make sure I find the data?
D-language: Only if you "care" about accuracy!

Call me ludicrous, but is this really what we want to push on someone as a"unicode-aware" language?

Plus, a combining character (such as an umlaut or accent) is part of a
character, but may be a separate code point. If that's on the last
character in the word such as fiancé, then searching for fiance will
result in a match without proper decoding!
Now if you are going to do real characters... If source/needle arenormalized you still can avoid lots of work by searching withoutdecoding. All you need to decode is one codepoint on each successfulmatch to see if there is a modifier at end of matched portion.But it depends on how you want to match if it's case-insensitive searchit will be a lot more complicated, but anyway it boils down to this:1) do inexact search, get likely match ( false positives are OK,negatives not) no decoding here
2) once found check it (or parts of it) with proper decoding
There are cultural subtleties, that complicate these steps if you takethem into account, but it's doable.

I agree with you that simple searches using only byte (or dchar)comparison does not work, and can be optimized based on several factors.The easiest thing is to find a code unit sequence that only has one validform, then search for that without decoding. Then when found, decode thecharacters around it. Or if that isn't possible, create all theun-normalized forms for one grapheme (based on how likely it is to occur),and search for one of those in the undecoded stream.

This can all be built into a specialized string type. There's actuallysome really interesting problems to solve in this space I think. I'vecreated a basic string type that has lamented in my unfinished pile ofstuff to do. I think it can be done in a way that is close to asefficient as arrays for the most common operations (slicing, indexing,etc.), but is *correct* before it is efficient. You should always be ableto drop into "array mode" and deal with the code-units.

Or if fiancé uses a

precomposed é, it won't match. So two valid representations of the word
either match or they don't. It's just a complete mess without proper
unicode decoding.


It's a complete mess even with proper decoding ;)

Yes, all the more reason to solve the problem correctly so the haplessunicode novice user doesn't have to!


-Steve

Re: Why the hell doesn't foreach decode strings

Reply via email to