Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Steven Schveighoffer Fri, 14 Jan 2011 04:50:28 -0800

On Fri, 14 Jan 2011 01:44:19 -0500, Nick Sabalausky <a@a.a> wrote:

"Andrei Alexandrescu" <seewebsiteforem...@erdani.org> wrote in message
news:igoqrm$1n5r$1...@digitalmars.com...

On 1/13/11 10:26 PM, Nick Sabalausky wrote:
[snip]

[ 'f', {u with the umlaut}, 'n', 'f' ]

Or:

[ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

Those *both* get rendered exactly the same, and both represent the same
four-letter sequence. In the second example, the 'u' and the {umlaut
combining character} combine to form one grapheme. The f's and n's just
happen to be single-code-point graphemes.

Note that while some characters exist in pre-combined form (such as the
{u

with the umlaut} above), legend has it there are others than can onlybe

represented using a combining character.

It's also my understanding, though I'm not certain, that sometimes
multiple
combining characters can be used together on the same "root" character.

Thanks. One further question is: in the above example withu-with-umlaut,

there is one code point that corresponds to the entire combination. Are
there combinations that do not have a unique code point?

My understanding is "yes". At least that's what I've heard, and I'veneverheard any claims of "no". I don't know of any specific ones offhand,though.Actually, it might be possible to use any combining character with anyold

letter or number (like maybe a 7 with an umlaut), though I'm not certain.

FWIW, the Wikipedia article might help, or at least link to other things
that might help: http://en.wikipedia.org/wiki/Combining_character


http://en.wikipedia.org/wiki/Unicode_normalization

Linked from that page, the normalization process is probably something weneed to look at. Using decomposed canonical form would mean we need morestate than just what code-unit are we on, plus it creates more likelyhoodthat a match will be found with part of a grapheme (spir or Michel broughtit up earlier). So I think the correct case is to use composed canonicalform. This is after just reading that page, so maybe I'm missingsomething.

Non-composable combinations would be a problem. The string range isformed on the basis that the element type is a dchar. If there arecombinations that cannot be composed into a single dchar, then the elementtype has to be a dchar array (or some other type which contains all theinfo). The other option is to simply leave them decomposed. Then yourisk things like partial matches.

I'm leaning towards a solution like this: While iterating a string, itshould output dchars in normalized composed form. But a specializedcomparison function should be used when doing things like searches orregex, because it might not be possible to compose two combiningcharacters.

The drawback to this is that a dchar might not be able to represent agrapheme (only if it cannot be composed), but I think it's too much of ahit in complexity and performance to make the element type of a stringlarger than a dchar.

Those who wish to work with a more comprehensive string type can use amore complex string type such as the one created by spir.


Does that sound reasonable?

-Steve

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Reply via email to