Re: Major performance problem with std.array.front()

Brad Anderson Fri, 07 Mar 2014 15:56:25 -0800

On Friday, 7 March 2014 at 22:27:35 UTC, H. S. Teoh wrote:

On Fri, Mar 07, 2014 at 09:58:39PM +0000, Vladimir Panteleevwrote:
On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
>On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev>wrote:
>>No, it doesn't.
>>
>>import std.algorithm;
>>
>>void main()
>>{
>>   auto s = "cassé";
>>   assert(s.canFind('é'));
>>}
>>
>
>Hm, I'm not following? Works perfectly fine on my system?
Probably because your browser is normalizing the unicode stringwhen you
copy-n-paste Vladimir's message? See below:
Something's messing with your Unicode. Try downloading andcompiling
this file:
http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d
I downloaded the file and looked at it through `od -ctx1`: thefirst éis encoded as the byte sequence 65 cc 81, that is, [U+65,U+301] (smallletter e + combining diacritic acute accent), whereas thesecond é isencoded as c3 a9, that is, U+E9 (precomposed small letter ewith acute
accent).
This illustrates one of my objections to Andrei's post: byauto-decodingbehind the user's back and hiding the intricacies of unicodefrom him,it has masked the fact that codepoint-for-codepoint comparisonof aunicode string is not guaranteed to always return the correctresults,
due to the possibility of non-normalized strings.
Basically, to have correct behaviour in all cases, the usermust beaware of, and use, the Unicode collation / normalizationalgorithmsprescribed by the Unicode standard. What we have instd.algorithm rightnow is an incomplete implementation with non-working edge cases(likeVladimir's example) that has poor performance to start with.Its only
redeeming factor is that the auto-decoding hack has given it the
illusion of being correct, when actually it's not correctaccording tothe Unicode standard. I don't see how this is necessarilysuperior to
Walter's proposal.


T

To me, the status quo feels like an ok compromise betweenperformance and correctness. Everyone is pointing out thatworking at the code point level is bad because it's not correctbut working at the code unit level as Walter proposes is correcteven less often so that's not really an argument for moving tothat. It is, however, an argument for forcing the user to decidewhat level of correctness and performance they need.

Walter's idea (code unit level) would be fastest but leastcorrect.

The current is somewhat fast and is somewhat correct.

The next level, graphemes, would be slowest of all but mostcorrect.

It seems like there is just no way to avoid the tradeoff betweenspeed and correctness so we shouldn't try, only try to force theuser to make a decision.

Maybe some more string types are in order (hrm). In order ofperformance to correctness:


 string, wstring (code units)
 dstring         (code points)
+gstring         (graphemes)

(do grapheme's completely normalize? If not probably need anotherlevel, say, nstring)

Then if a user needs correctness over performance they just workwith gstrings. If they need performance over correctness theywork with strings (assuming some of Walter's idea happens,otherwise they'd work with string.representation).

Re: Major performance problem with std.array.front()

Reply via email to