On 01/17/2011 01:44 PM, Steven Schveighoffer wrote:
On Sun, 16 Jan 2011 13:06:16 -0500, Andrei Alexandrescu
<seewebsiteforem...@erdani.org> wrote:

On 1/15/11 9:25 PM, Jonathan M Davis wrote:
Considering that strings are already dealt with specially in order to
have an
element of dchar, I wouldn't think that it would be all that
distruptive to make
it so that they had an element type of Grapheme instead. Wouldn't
that then fix
all of std.algorithm and the like without really disrupting anything?

It would make everything related a lot (a TON) slower, and it would
break all client code that uses dchar as the element type, or is
otherwise unprepared to use Graphemes explicitly. There is no question
there will be disruption.

I would have agreed with you last week. Now I understand that using
dchar is just as useless for unicode as using char.

Will it be slower? Perhaps. A TON slower? Probably not.

But it will be correct. Correct and slow is better than incorrect and
fast. If I showed you a shortest-path algorithm that ran in O(V) time,
but didn't always find the shortest path, would you call it a success?

We need to get some real numbers together. I'll see what I can create
for a type, but someone else needs to supply the input :) I'm on short
supply of unicode data, and any attempts I've made to create some result
in failure. I have one example of one composed character in this thread
that I can cling to, but in order to supply some real numbers, we need a
large amount of data.

-Steve

Hello Steve & Andrei,


I see 2 questions: (1) whether we should provide Unicode correctness as a default or not? and relative points of level of abstraction & normalisation (2) what is the best way to implement such correctness? Let us put aside (1) for a while, anyway nothing prevents us to experiment while waiting for an agreement; such experiment would in fact feed the debate with real facts instead of "airy" ideas.

It seems there are 2 opposite approaches to Unicode correctness. Mine was to build a types that systematically abstracts UCS-created issues (that real whole characters are coded by mini-arrays of codes I call "code piles", that those piles have variable lengths, _and_ that cheracters even may have several representations). Then, in my wild guesses, every text manipulation method should obviously be "flash fast", actually faster than any on the fly algo by several orders of magnitude. But Michel let me doubt on that point.

The other approach is precisely to provide needed abstraction ("piling" and normalisation) on the fly. Like proposed by Michel, and like Objective-C does, IIUC. This way seems to me closer to a kind of re-design Steven's new String type and/or Andrei's VLERange.

As you say, we need real timing numbers to decide. I think we should measure at least 2 routines:
* indexing (or better iteration?) which only requires "piling"
* counting occurrences of a given character or slice, which requires both piling and normalisation

I do not feel like implementating such routine for the on the fly version, and have no time for this in coming days; but if anyone is volunteer, feel free to rip code and data from Text's current implementation if it may help.

As source text, we can use the one at https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/data/unicode.txt (already my source for perf measures). It has the only merit to be a text (about unicode!) in twelve rather different languages.

[My intuitive guess is that Michel is wrong by orders of magnitude --but again I know about nothing about code performance.]


Denis
_________________
vita es estrany
spir.wikidot.com

Reply via email to