On 01/19/2011 08:43 AM, Ali Çehreli wrote:
Michel Fortin wrote:
 > On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.for...@michelf.com>
 > said:

 > So perhaps the best interface for strings would be to provide multiple
 > range-like interfaces that you can use at the level you want.

That's what I've been thinking. The users can choose whether they want
random access or not. A grapheme-aware string can provide random access
at a space cost, or no random access for efficient space use.

I see 5 layers in string processing. Layers 1 and 2 are currently
handled by D, sometimes in an unclear way. e.g. char[] may be used as an
array of code units or an array of code points depending on the type of
iteration.

This is very good and helpful summary. But you do not list all relevant aspects of the question, I guess. Defining which codes belong to a given grapheme (what I call "piling") is necessary for true O(1) random-access, but not only. More importantly, all operations involving equality comparison (find, count, replace,...) require normalisation --in addition to piling.
A few notes:

1) Code units: This is what D provides with its string types

This layers models RandomAccessRange

This level is pure implementation artifact that simply cannot make any sense. (from user and thus programmer points of view) Any kind of text manipulation (slice, find, replace...) may lead to random incorrectness, except when source texts can be guaranteed to hold plain ASCII (which may be hard to prove). Conversely, pieces of text only passed around by an app do not require any more costly representation, in terms of time (decoding) or space. In addition, concat works provided all pieces share the same encoding (ASCII beeing a subset of most historic charsets and of UTF-8).

2) Code points: This is what D and Phobos provide for example with
foreach(d; stride(s, 1))

dchar[] models RandomAccessRange at this layer

char[] and wchar[] model ForwardRange at this layer

(If I understand it correctly, Steven Schveighoffer is trying to provide
a pseudo-RandomAccessRange to char[] and wchar[] with his string type.)

This level is also a kind of implementation artifact, compared to historic charsets, but actually based on a real fact of natural languages: they hold composite characters that can thus be coded by combining lower-level codes which represent "scripting marks" (base & combining ones). For this reason, this level can have some sense. My latest guess is that apps that consider text as a study object (read linguistic apps), instead of a means, may regurarly need operating at this level, in addition to the next one. Normalisation can be applied at this level --and is necessary for the above kind of use case. But using it for operations requiring compare will typically also require "piling", that is the next level, if only to determine what is to be compared.

3) Graphemes: This is what the string type that spir is working on.
There could be at least two types:

This is the meaningful level for, probably, nearly all applications.

3a) RandomAccessGraphemeRange: Has random access but the data type is large

I guess this is Text's approach? Text is "flash fast" indeed for any operation benefiting from random-access. But not only: since it normalises its input, it should be far faster for any operation using compare (rough evaluations suggest a speed ratio of 1 to 2 orders of magnitude). The cost is high in terms of space, which in turn certainly reduces its speed gain in the general case, because to cache (miss) effects. (Thank you Michel for making this clear.)

3b) ForwardGraphemeRange: space-efficient but does not provide random
access

Is it what Andrei expects, namely a Grapheme type with a corresponding ByGrapheme iterator IIUC?
Time efficiency of operations?

3) metadata RandomAccessGraphemeRange

Michel Fortin suggested (off list) an alternative approach to Text: instead of actually "piling" at construction time, just store metadata upon grapheme bounds. The core benefit is indeed to keep "normal" text storage (meaning *char[], for modification): would this point please Andrei better? I let you evaluate various consequences of this change (mostly positive, I guess). The same metadata principle could certainly be used for further optimisations, but this is another story. I'm motivated to implement this variant, looke like best of both worlds tome. (support welcome ;-)

I think the programmers would be happy to be able to choose.

4) Letters: Uses either 3a or 3b. This is the layer where the idea of a
writing system enters the picture: lower/upper case transformations and
sorting happen at this layer. (I have a library that tries to handle
this layer but is ignorant of graphemes; I am waiting for spir's string
type. ;))

4a) Models RandomAccessRange if based on a RandomAccessGraphemeRange

4b) Models ForwardRange if based on a ForwardGraphemeRange

I do not understand what this level means. For me, letters are, precisely, archetypical true characters, meaning level 3.

[Note: "grapheme", used by Unicode to denote the common sense of "character", is simply wrong: "sh" and "ti" are graphemes in english (for the same phoneme /ʃ/), not characters; and tab, §, or © are probalby not considered graphemes by linguists, while they are characters. This is the reason why I try to avoid this term and use "character", like ICU's doc, to avoid even more confusion.]

5) Text: Collection of Letters. This is where a name like "ali & tim" is
correctly capitalized as "ALİ & TIM" because the text consists of two
separate writing systems. (The same library that I mentioned in 4 tries
to handle this layer as well.)

This is an immensely complicated field. Note that it has nothing to do with text & character representation issues: whatever the character set, one has to confront problems like uppercase of 'i', 'ss' vs 'ß', definiton of "letter" or "character", matching, sorting order... Text does not even try to address natural language issues. Instead it deals onl,y but hopefully clearly & correctly, with restoring simple and safe representation for client apps.

Ali

Denis
_________________
vita es estrany
spir.wikidot.com

Reply via email to