string repr & range levels [was: Re: VLERange: ...]

spir Wed, 19 Jan 2011 02:39:54 -0800

On 01/19/2011 08:43 AM, Ali Çehreli wrote:

Michel Fortin wrote:
 > On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.for...@michelf.com>
 > said:


 > So perhaps the best interface for strings would be to provide multiple
 > range-like interfaces that you can use at the level you want.

That's what I've been thinking. The users can choose whether they want
random access or not. A grapheme-aware string can provide random access
at a space cost, or no random access for efficient space use.

I see 5 layers in string processing. Layers 1 and 2 are currently
handled by D, sometimes in an unclear way. e.g. char[] may be used as an
array of code units or an array of code points depending on the type of
iteration.

This is very good and helpful summary. But you do not list all relevantaspects of the question, I guess. Defining which codes belong to a givengrapheme (what I call "piling") is necessary for true O(1)random-access, but not only. More importantly, all operations involvingequality comparison (find, count, replace,...) require normalisation--in addition to piling.

A few notes:

1) Code units: This is what D provides with its string types

This layers models RandomAccessRange

This level is pure implementation artifact that simply cannot make anysense. (from user and thus programmer points of view)Any kind of text manipulation (slice, find, replace...) may lead torandom incorrectness, except when source texts can be guaranteed to holdplain ASCII (which may be hard to prove).Conversely, pieces of text only passed around by an app do not requireany more costly representation, in terms of time (decoding) or space. Inaddition, concat works provided all pieces share the same encoding(ASCII beeing a subset of most historic charsets and of UTF-8).

2) Code points: This is what D and Phobos provide for example with
foreach(d; stride(s, 1))

dchar[] models RandomAccessRange at this layer

char[] and wchar[] model ForwardRange at this layer

(If I understand it correctly, Steven Schveighoffer is trying to provide
a pseudo-RandomAccessRange to char[] and wchar[] with his string type.)

This level is also a kind of implementation artifact, compared tohistoric charsets, but actually based on a real fact of naturallanguages: they hold composite characters that can thus be coded bycombining lower-level codes which represent "scripting marks" (base &combining ones).For this reason, this level can have some sense. My latest guess is thatapps that consider text as a study object (read linguistic apps),instead of a means, may regurarly need operating at this level, inaddition to the next one.Normalisation can be applied at this level --and is necessary for theabove kind of use case. But using it for operations requiring comparewill typically also require "piling", that is the next level, if only todetermine what is to be compared.

3) Graphemes: This is what the string type that spir is working on.
There could be at least two types:


This is the meaningful level for, probably, nearly all applications.

3a) RandomAccessGraphemeRange: Has random access but the data type is large

I guess this is Text's approach? Text is "flash fast" indeed for anyoperation benefiting from random-access. But not only: since itnormalises its input, it should be far faster for any operation usingcompare (rough evaluations suggest a speed ratio of 1 to 2 orders ofmagnitude).The cost is high in terms of space, which in turn certainly reduces itsspeed gain in the general case, because to cache (miss) effects. (Thankyou Michel for making this clear.)

3b) ForwardGraphemeRange: space-efficient but does not provide random
access

Is it what Andrei expects, namely a Grapheme type with a correspondingByGrapheme iterator IIUC?

Time efficiency of operations?

3) metadata RandomAccessGraphemeRange

Michel Fortin suggested (off list) an alternative approach to Text:instead of actually "piling" at construction time, just store metadataupon grapheme bounds. The core benefit is indeed to keep "normal" textstorage (meaning *char[], for modification): would this point pleaseAndrei better?I let you evaluate various consequences of this change (mostly positive,I guess). The same metadata principle could certainly be used forfurther optimisations, but this is another story.I'm motivated to implement this variant, looke like best of both worldstome. (support welcome ;-)

I think the programmers would be happy to be able to choose.

4) Letters: Uses either 3a or 3b. This is the layer where the idea of a
writing system enters the picture: lower/upper case transformations and
sorting happen at this layer. (I have a library that tries to handle
this layer but is ignorant of graphemes; I am waiting for spir's string
type. ;))

4a) Models RandomAccessRange if based on a RandomAccessGraphemeRange

4b) Models ForwardRange if based on a ForwardGraphemeRange

I do not understand what this level means. For me, letters are,precisely, archetypical true characters, meaning level 3.

[Note: "grapheme", used by Unicode to denote the common sense of"character", is simply wrong: "sh" and "ti" are graphemes in english(for the same phoneme /ʃ/), not characters; and tab, §, or © areprobalby not considered graphemes by linguists, while they arecharacters. This is the reason why I try to avoid this term and use"character", like ICU's doc, to avoid even more confusion.]

5) Text: Collection of Letters. This is where a name like "ali & tim" is
correctly capitalized as "ALİ & TIM" because the text consists of two
separate writing systems. (The same library that I mentioned in 4 tries
to handle this layer as well.)

This is an immensely complicated field. Note that it has nothing to dowith text & character representation issues: whatever the character set,one has to confront problems like uppercase of 'i', 'ss' vs 'ß',definiton of "letter" or "character", matching, sorting order...Text does not even try to address natural language issues. Instead itdeals onl,y but hopefully clearly & correctly, with restoring simple andsafe representation for client apps.

Ali


Denis
_________________
vita es estrany
spir.wikidot.com

string repr & range levels [was: Re: VLERange: ...]

Reply via email to