Vitaly Magerya scripsit:

> But there would be no `substring'; at least not with integer range,
> right? If chars are gone, then `string-length' as well as integer
> indexes are also gone. So the general way to decompose a string
> would be to use functions that split strings at occurrences
> of given substrings (or patterns).

I've backed off a bit from my original expansive claims.  The codepoint
level is indeed the lowest level that should be accessible: if you want
a lower level than that, convert the string to a bytevector using a
specified encoding.  So codepoints have a special status that can be
used by substring and string-length. [*]

However, characters are now identified with single-character strings and
represented as such on the Scheme level.  At the implementation level,
single-character strings may well have a special representation, provided
-- as I also think appropriate -- that strings are made immutable.
(If you want mutable sequences of codepoints, vectors are your friend.)
The #\ syntax could be retained for backward compatibility.

[*] Riastradh thinks allowing user-chosen indices restricts implementers
too much, and thinks we should have an opaque "string position" type,
maybe integers and maybe not, with external iterators for moving backward
and forward in strings by codepoint.  I'm sympathetic to this view but
haven't adopted it.

> Can you split a string into some logical parts that map to characters
> in en-us text, but possibly to other things in other languages?

I assume you are talking about things like Spanish "ll" and Welsh "ngh",
which are considered single letters.  String iterators can be provided
for all sorts of things, some locale-independent (codepoints, DGCs),
others locale-dependent.

> Are there such splittings that make sense for all languages (locales)?

DCGs do: a DCG is either a Korean syllable or a base character with any
associated diacritics.  Unicode defines the details.  Roughly, a DCG is
"what a user thinks of as a character".

-- 
Long-short-short, long-short-short / Dactyls in dimeter,     John Cowan
Verse form with choriambs / (Masculine rhyme):           [email protected]
One sentence (two stanzas) / Hexasyllabically
Challenges poets who / Don't have the time.     --robison who's at texas dot net

_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Reply via email to