Christopher Chittleborough scripsit:
> Perl 6 will support this view of strings (along with lots of
> other fancy stuff which I'll ignore here -- the details can be
> found at http://perlcabal.org/syn/S02.html).
Yes, Perl 6 does support the default grapheme cluster (DCG) way of life.
But it also supports the byte, codepoint, and characters-of-a-language
ways of life, and allows you to set whichever one you like in the
current lexical scope, just like declaring a variable. There are deeper
consequences. For instance, for most languages it suffices to have the
string-length function return an integer, although they may disagree on
which integer it is.
But not Perl 6. No, in that language, the string-length function
returns a StrLen, an opaque object which is automatically coerceable to an
integer whose value depends on the current lexical scope! As the text you
pointed to says: "A given StrLen may know that it represents 18 bytes, 7
codepoints, 3 graphemes, and 1 letter in Malayalam" (a fictional example;
no Malayalam letter is so extravagant). Is this really what we want,
an integer (or integeroid) whose value depends on the lexical scope it
is evaluated in? Auuuughh!
This is what happens when a language implementation group has no need for
compatibility with anything, not even previous versions of the language
("the Perl5 to Perl6 compiler^Wtranslator will deal with that"), and has
neither commercial nor academic constraints on its schedule. But then,
why should Perl 6 ever reach 1.0? Natural languages, which are its
models, certainly never do.
> The character==grapheme approach makes strings simple and
> characters complicated.
It certainly does the latter, but (alas) without achieving the former.
Such a simple question as "What is the lowercase equivalent of capital
sigma?" cannot be answered in the DCG framework. The simplest answer
is "Final sigma at the end of a word, regular sigma elsewhere", which
requires the word rather than the DCG view of strings. However, this
will not quite do either. For example, when /filos/ appears with a period
after it, this may be the word 'love' at the end of a sentence, in which
case it gets a final sigma, φιλος. Or it may be an abbreviation
for /filosofia/, in which case it gets a regular sigma, φιλοσ.
(In R6RS, char-downcase returns regular sigma always, and string-downcase
is expected to get the word issue correct, but not the subtlety about
abbreviations, for which the implementation would have to, like, *actually
understand Greek*.)
> So we have an alternative to the character==codepoint approach.
> Is it a good alternative? I don't know.
It's one you can build over the codepoint approach. Using it as the
base level makes characters much more complicated, doesn't remove the
complications of strings, and forces you to smuggle in artifacts from the
codepoint level anyhow in order to get done what you need to get done.
For example, "A with circumflex and dot below" is the same DCG as
"A with dot below and circumflex", but "A with circumflex and grave"
is *not* the same as "A with grave and circumflex", because the first
has the grave atop the circumflex and the second vice versa. Vietnamese
typography insists on the first, never the second. Knowing this requires
understanding not merely what is a base and what is a modifier character,
but which canonical combining class each modifier character belongs to.
And then there's Korean, whose DCGs are the syllabic clumps in which Korean
is written in an attempt to make what is basically a simple alphabet work
well in a typographical tradition designed for squarish Chinese characters
set on a grid. I've been there. I don't recommend you follow me.
> While I'm here, let me get a few personal opinions off my chest:
I 100% agree with all of these.
--
I don't know half of you half as well John Cowan
as I should like, and I like less than half [email protected]
of you half as well as you deserve. http://www.ccil.org/~cowan
--Bilbo
_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss