Re: Why the hell doesn't foreach decode strings

Dmitry Olshansky Wed, 26 Oct 2011 14:55:30 -0700

On 26.10.2011 16:12, Steven Schveighoffer wrote:
<snip>


I agree the string type needs to be fleshed out before the language
adopts it. I think we could legitimately define a string type that
auto-assigns from it's respective char[] array type. Once it's shown
that the type is nearly as fast as a char[] array, it can be used as the
default type for string literals (the only real place where the language
gets in the way).


That's it.

D-language: Here, use this search algorithm, it works most of the time,
but may not work correctly in some cases. If you run into one of those
cases, you'll have to run a specialized search algorithm for strings.
User: How do I know I hit one of those cases?
D-language: You'll have to run the specialized version to find out.
User: Why wouldn't I just run the specialized version first?
D-language: Well, because it's slower!
User: But don't I have to use both algorithms to make sure I find the
data?
D-language: Only if you "care" about accuracy!

Call me ludicrous, but is this really what we want to push on someone as
a "unicode-aware" language?


No, but we might want to fix a string search to do a little more -
namely check if it skewed a graphem (assuming normalizations match).


That's a big assumption. Valid unicode is valid unicode, even if it's
not normalized.

BTW adding normalization to std.uni is another thing to do right now.
That should be a good enough start, and looking at unicode standard,
things are rather fluid there meaning that "unicode-aware" language
should be sufixed by version number :)


I agree we need normalization and it is not necessary to involve the
compiler in this. I'd suggest a two to three phase approach:

1. Leave phobos' schizo view of "char arrays are not arrays" for now,
and build the necessary framework to get a string type that actually works.
2. Remove schizo view.
3. (optional) make compiler use library-defined string type for string
literals.

Something like that, I may land a hand here, as I plan to merge someunicode stuff between std.regex and std.uni. As extras probably stridingby grapheme and support for various degree of case-insensitive stringcomparison.

Plus, a combining character (such as an umlaut or accent) is part of a
character, but may be a separate code point. If that's on the last
character in the word such as fiancé, then searching for fiance will
result in a match without proper decoding!


Now if you are going to do real characters... If source/needle are
normalized you still can avoid lots of work by searching without
decoding. All you need to decode is one codepoint on each successful
match to see if there is a modifier at end of matched portion.
But it depends on how you want to match if it's case-insensitive
search it will be a lot more complicated, but anyway it boils down to
this:
1) do inexact search, get likely match ( false positives are OK,
negatives not) no decoding here
2) once found check it (or parts of it) with proper decoding

There are cultural subtleties, that complicate these steps if you take
them into account, but it's doable.


I agree with you that simple searches using only byte (or dchar)
comparison does not work, and can be optimized based on several factors.
The easiest thing is to find a code unit sequence that only has one
valid form, then search for that without decoding. Then when found,
decode the characters around it. Or if that isn't possible, create all
the un-normalized forms for one grapheme (based on how likely it is to
occur), and search for one of those in the undecoded stream.


Yes, it's all revolves around find all of variations of a substring.
If user is willing to spend some time upfront, this could be done in
one pass e.g. using the trick I employed in new std.regex.
For a rough idea see ShiftOr string search,
that is easy & fast inexact search:
http://igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION0060

There are even better variations of this stuff in the wild.


This sounds good. I'll have a look at it when I have time.

This can all be built into a specialized string type. There's actually
some really interesting problems to solve in this space I think. I've
created a basic string type that has lamented in my unfinished pile of
stuff to do. I think it can be done in a way that is close to as
efficient as arrays for the most common operations (slicing, indexing,
etc.), but is *correct* before it is efficient. You should always be
able to drop into "array mode" and deal with the code-units.


Here is where we probably differ, to me there is no simple safe fast
string type. Why? Because if you need anything "not pedantic and safe
above all" you work with implicit data type of e.g. "a chunk of UTF-8
bytes that are normalized in NFC", ...


Specialized types which dictate normalization could be created. We do
not have to always assume the worst if we are optimizing.

But in general, when you read a utf-8 file, it could have anything in
it. It may not even be normalized.


There should be a way of e.g.
auto my_string = normalize(readText("file.txt"), Normalize.NFKC);
//my_string has typed normalized UTF-8 or somesuch

Yes, that should keep general non-assuming case. Though it still can dosome clever things, since as W3C points out >90% of text on web isalready normalized in NFC (I can search for a proof link, if hard pressed).

I'd rather see them as sealed array with a bunch of meta info, and
string functions to specialize on them in order to do some cool speed
hacks. Of course, drilling down to row data should be possible for the
user as well.


This is exactly what I am envisioning ;) Except I'd build the meta-info
into the type.


Yeah, I meant that the type caries this info.
Looks like we are more in agreement then I suspected ;)

Or if fiancé uses a

precomposed é, it won't match. So two valid representations of the
word
either match or they don't. It's just a complete mess without proper
unicode decoding.


It's a complete mess even with proper decoding ;)


Yes, all the more reason to solve the problem correctly so the hapless
unicode novice user doesn't have to!



Hapless novice unicode user don't stand a chance.
I'm serious. One needs to know a lot of these "basics" to e.g. know
why indexing by "true" character could be so slow, about encodings,
normalizations etc. Yet we can save a metric ton of special cased crap
from ever hitting his eyes.


I disagree. 99% of the time I use strings, I don't care about indexes. I
just want to deal with whole strings and substrings. I rarely use
arbitrary indexes, and when I do, I'm assuming ascii data.

The hapless novice unicode user doesn't need to know unicode to do:

writefln("hello, world!");

Or to do:

if(find(str, "xyz")) {...}

Or to even use regex!

To print chars you don't need to know unicode, you OS needs to :)Though that wasn't string processing. However find/match is indeed agood example. You are probably right, if we cover enough of commonoperations most people might not have to "look inside".



--
Dmitry Olshansky

Re: Why the hell doesn't foreach decode strings

Reply via email to