Suggested magic for "a" .. "b"

Aaron Sherman Fri, 16 Jul 2010 09:41:08 -0700

Oh bother, I wrote this up last night, but forgot to send it. Here y'all go:

I've been testing ".." recently, and it seems, in Rakudo, to behave like
Perl 5. That is, the magic auto-increment for "a" .. "z" works very wonkily,
given any range that isn't within some very strict definitions (identical
Unicode general category, increasing, etc.) So the following:

"A" .. "z"

produces very odd results.

I'd like to suggest that we re-define this operator on strings as follows:

RESTRICTIONS:

First off, if either argument contains combining, modifying, undefined,
reserved or other codepoints which either cannot be treated as a single,
independent "character" or whose Unicode properties are not firmly
established in the Unicode specification, then an exception is immediately
raised. This must be done in order to assure that each character index can
be compared to each corresponding character index without the typical
Unicode ambiguities. Ligatures and other decomposable sequences are treated
by their codepoint in the current encoding, only.

Treatment of strings whose encodings differ should be possible, as all
comparisons are performed on codepoints.

If either argument is zero length, an exception is raised.

If either one argument is *, then it is assumed to stand for the largest
(RHS) or smallest (LHS) codepoint with the same Unicode general properties
as the opposite side (for each character index, if the other value is a
string of length > 1).

ALGORITHM:

If both arguments are strings of non-zero length, ".." will first determine
which is the shorter. This length is the "significant length". Any
characters after this length in the longer sequence are ignored (return
value might be an unthrown exception in this case?)

For all remaining characters, each character is considered with respect to
its correspondingly indexed character in the other string the following
algorithm is applied to determine the range that they represent (the LHS
character is referred to as "A", below and the RHS as "B")

The binary Unicode general category properties of A and B are considered
from the set of major category classes:

L, M, N, P, S, Z, C

Thus the Lu property or Pe property would be considered. The total range
consists of all codepoints lying between the lower of the two codepoints and
the higher of the two, inclusive, which share either the major and minor
Unicode general category property of A and B (if there is no minor subclass,
then codepoints without a minor subclass are considered with respect to that
endpoint). The ordering is determined by the ordering of A and B.

The range is then restricted to codepoints which share the same script as A
or B.

Thus, latin "a" and greek lowercase pi would define a range which included
all lower-case letters from the Latin and Greek scripts that fell between
their codepoints.

Having established this range for each correspondingly indexed letter, the
range for multi-character strings is defined by a left-significant counting
sequence. For example:

"Ab" .. "Be"

defines the ranges:

This results in a counting sequence (with the most significant character on
the left) as follows:

Currently, Rakudo produces this:

"Ab", "Ac", "Ad", "Ae", "Af", "Ag", "Ah", "Ai", "Aj", "Ak", "Al", "Am",
"An", "Ao", "Ap", "Aq", "Ar", "As", "At", "Au", "Av", "Aw", "Ax", "Ay",
"Az", "Ba", "Bb", "Bc", "Bd", "Be"

which I don't think is terribly useful.

Many useful results from this suggested change:

"C" .. "A" = <C B A> (Rakudo: <>)

"(" .. "}" = <( ) [ ] { }> (because open-paren is Pe and close-brace is Ps,
therefore all Pe and Ps codepoints in the range are included).

"Α" .. "Ω" = <Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω> (notice that
codepoint U+03A2 is gracefully skipped, as it is undefined and thus has no
properties).

"apple" .. "orange" = the counting sequence defined by the ranges "a" ..
"o", "p" .. "r", "p" .. "a", "l" .. "n", "e" .. "g" (notice that the string
"orang" will be part of the result set, but "orange" will not.)

In addition:

One alternative to truncation of strings of differing lengths is to extend
the sequence. For example, if we ask for "a" .. "bc", then we might produce
<a b ac bc>. Where the extension is the original range plus the same range
where each element has the extended string elements concatenated. This might
even be iterated for every additional codepoint in the longer string. For
example: "a" .. "bcd" = <a b ac bc acd bcd>

"..." could have similar semantics. In the case of A, B ... C, for length 1
strings, the range A .. B is simply projected forward to until x ge C (if
A..B is increasing, le otherwise). C's properties probably should not be
considered at all. In the case of length > 1 strings each character index is
projected forward independently until any one character index ge the
corresponding index in the terminator, and there is no "counting":

"AAA", "BCD" ... "GGG" = <AAA BCD CEG>

If any index in the sequence does not increment (e.g. "AA", "AB" ... "ZZ")
then there is an implication that counting is required. You should be able,
in this case, to imply incrementing the left or right side as most
significant (e.g. "AA", "BA" ... "ZZ" is also valid). It is, however, an
error to try to increment indexes in any other ordering (e.g. "AAA", "ABA"
... "ZZZ"). Once a counting sequence has been established, lookahead must be
employed to determine the extent of the range (e.g. "A", "B" can continue
through all "Latin" Lu codepoints, so in order to know when to cycle, you
must determine how many codepoints lie in the full range. This implies that
length > 1 strings in "..." operations which imply a counting sequence, are
not strictly evaluated lazily, though some laziness may still be employed.

--
Aaron Sherman
Email or GTalk: a...@ajs.com
http://www.ajs.com/~ajs

Suggested magic for "a" .. "b"

Reply via email to