On 2011-01-11 20:28:26 -0500, Steven Wawryk <stev...@acres.com.au> said:

Sorry if I'm jumping inhere without the appropriate background, but I don't understand why jumping through these hoops are necessary. Please let me know if I'm missing anything.

Many problems can be solved by another layer of indirection. Isn't a string essentially a bidirectional range of code points built on top of a random access range of code units?

Actually, displaying a UTF-8/UTF-16 string involves a range of of glyphs layered over a range of graphemes layered over a range of code points layered over a range of code units. Glyphs represent the visual characters you can get from a font, they often map one-to-one with graphemes but not always (ligatures for instance). Graphemes are what people generally reason about when they see text (the so called "user-perceived characters"), they often map one-to-one with code points but not always (combining marks for instance). Code points are a list of standardized codes representing various elements of a string, and code units basically encode the code points.

If you're writing an XML, JSON or whatever else parser you'll probably care about code points. If you're advancing the insertion point in a text field or count the number of user-perceived characters you'll probably want to deal with graphemes. For searching a substring inside a string, or comparing strings you'll probably want to deal with either graphemes or collation elements (collation elements are layered on top of code points). To print a string you'll need to map graphemes to the glyphs from a particular font.

Reducing string operations to code points manipulations will only work as long as all your graphemes, collation elements, or glyphs map one-to-one with code points.


It seems to me that each abstraction separately already fits within the existing D range framework and all the difficulties arise as a consequence of trying to lump them into a single abstraction.

It's true that each of these abstraction can fit within the existing range framework.


Why not choose which of these abstractions is most appropriate in a given situation instead of trying to shoe-horn both concepts into a single abstraction, and provide for easy conversion between them? When character representation is the primary requirement then make it a bidirectional range of code points. When storage representation and random access is required then make it a random access range of code units.

I think you're right. The need for a new concept isn't that great, and it gets complicated really fast.


--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Reply via email to