Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

spir Mon, 17 Jan 2011 09:27:19 -0800

On 01/17/2011 01:44 PM, Steven Schveighoffer wrote:

On Sun, 16 Jan 2011 13:06:16 -0500, Andrei Alexandrescu
<seewebsiteforem...@erdani.org> wrote:

On 1/15/11 9:25 PM, Jonathan M Davis wrote:

Considering that strings are already dealt with specially in order to
have an
element of dchar, I wouldn't think that it would be all that
distruptive to make
it so that they had an element type of Grapheme instead. Wouldn't
that then fix
all of std.algorithm and the like without really disrupting anything?


It would make everything related a lot (a TON) slower, and it would
break all client code that uses dchar as the element type, or is
otherwise unprepared to use Graphemes explicitly. There is no question
there will be disruption.


I would have agreed with you last week. Now I understand that using
dchar is just as useless for unicode as using char.

Will it be slower? Perhaps. A TON slower? Probably not.

But it will be correct. Correct and slow is better than incorrect and
fast. If I showed you a shortest-path algorithm that ran in O(V) time,
but didn't always find the shortest path, would you call it a success?

We need to get some real numbers together. I'll see what I can create
for a type, but someone else needs to supply the input :) I'm on short
supply of unicode data, and any attempts I've made to create some result
in failure. I have one example of one composed character in this thread
that I can cling to, but in order to supply some real numbers, we need a
large amount of data.

-Steve


Hello Steve & Andrei,

I see 2 questions: (1) whether we should provide Unicode correctness asa default or not? and relative points of level of abstraction &normalisation (2) what is the best way to implement such correctness?Let us put aside (1) for a while, anyway nothing prevents us toexperiment while waiting for an agreement; such experiment would in factfeed the debate with real facts instead of "airy" ideas.

It seems there are 2 opposite approaches to Unicode correctness. Minewas to build a types that systematically abstracts UCS-created issues(that real whole characters are coded by mini-arrays of codes I call"code piles", that those piles have variable lengths, _and_ thatcheracters even may have several representations). Then, in my wildguesses, every text manipulation method should obviously be "flashfast", actually faster than any on the fly algo by several orders ofmagnitude. But Michel let me doubt on that point.

The other approach is precisely to provide needed abstraction ("piling"and normalisation) on the fly. Like proposed by Michel, and likeObjective-C does, IIUC. This way seems to me closer to a kind ofre-design Steven's new String type and/or Andrei's VLERange.

As you say, we need real timing numbers to decide. I think we shouldmeasure at least 2 routines:

* indexing (or better iteration?) which only requires "piling"

* counting occurrences of a given character or slice, which requiresboth piling and normalisation

I do not feel like implementating such routine for the on the flyversion, and have no time for this in coming days; but if anyone isvolunteer, feel free to rip code and data from Text's currentimplementation if it may help.

As source text, we can use the one athttps://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/data/unicode.txt(already my source for perf measures). It has the only merit to be atext (about unicode!) in twelve rather different languages.

[My intuitive guess is that Michel is wrong by orders of magnitude --butagain I know about nothing about code performance.]



Denis
_________________
vita es estrany
spir.wikidot.com

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Reply via email to