Re: Suggested magic for "a" .. "b"

Aaron Sherman Tue, 20 Jul 2010 16:31:29 -0700

This is a long reply, but I read it over a few times, and I don't see any
fat to trim. This isn't really a simple issue for which intuition is going
to be a sufficient guide, though I agree fully that it needs to be high on
or at the top of the list.

On Sun, Jul 18, 2010 at 6:26 AM, Moritz Lenz <mor...@faui2k3.org> wrote:

> In general, stuffing more complex behaviour into something that feels
> unintuitive is rarely (if ever) a good solution.

Walk with me a bit, and let's explore the concept of intuitive character
ranges? This was my suggestion, which seems pretty basic to me:

"x .. y", for all strings x and y, which are composed of a single, valid
codepoint which is neither combining nor modifying, yields the range of all
valid, non-combining/modifying codepoints between x and y, inclusive which
share the Unicode script, general category major property and general
category minor property of either x or y (lack of a minor property is a
valid value).

In general we have four problems with current specification and
implementation on the Perl 6 and Perl 5 sides:

1) Perl 5 and Rakudo have a fundamental difference of opinion about what
some ranges produce ("A" .. "z", "X" .. "T", etc) and yet we've never really
articulated why we want that.

2) We deny that a range whose LHS is "larger" than its RHS makes sense, but
we also don't provide an easy way to construct such ranges lazily otherwise.
This would be annoying only, but then we have declared that ranges are the
right way to construct basic loops (e.g. for (1..1e10).reverse -> $i {...}
which is not lazy (blows up your machine) and feels awfully clunky next to
for 1e10..1 -> $i {...} which would not blow up your machine, or even make
it break a sweat, if it worked)

3) We've never had a clear-cut goal in allowing string ranges (as opposed to
character ranges, which Perl 5 and 6 both muddy a bit), so "intuitive"
becomes sketchy at best past the first grapheme, and ever muddier when only
considering codepoints (thus that wing of my proposal and current behavior
are on much shakier ground, except in so far as it asserts that we might
want to think about it more).

4) Many ranges involving single characters on LHS and RHS result in null
or infinite output, which is deeply non-intuitive to me, and I expect many
others.

Solve those (and I tried in my suggestion) and I think you will be able to
apply intuition to character ranges, but only in so far as a human being is
likely to be able to intuit anything related to Unicode.

The current behaviour of the range operator is (if I recall correctly):
>
> 1) if both sides are single characters, make a range by incrementing
> codepoints
>

Sadly, you can't do that reasonably. Here are some examples of why, using
only Latin and Greek as examples (not the most convoluted Unicode sections
to be sure):

   - "Α" (capital Greek alpha, not Latin A) .. "Ω" would result in a range
   that contains an invalid codepoint (rakudo: drops the invalid codepoint,
   which you may have meant to imply, but I'm being pedantic because I want to
   come to a specification, not just a sense of the right solution)
   - "Ā" .. "Ē" would be "ĀāĂăĄąĆćĈĉĊċČčĎďĐđĒ" which is really not what
   you're likely to expect! (rakudo: Ā, infinitely repeating, which is an even
   larger problem for Katakana, where "オ" .. "ヺ" seems a very intuitive way to
   say "all Katakana non-cased letters" but fails because the range contains
   both cased and uncased; Perl 5 just prints "オ", and I think it also sneers
   at you)
   - "A" .. "z" comes out really odd because it contains punctuation (mind
   you, your suggestion is saner than Rakudo's current behavior on "A" .. "z"
   which is an infinite progression of capital-letter-only sequences of 1 or
   more characters! Intuitive, it's not.)

My point was that, if you want simple and intuitive out of Unicode, you're
kind of screwed. The closest you can get is to build your range using
properties and script. The way I suggested doing that was the simplest I
could think of. Speak up if you have a simpler one.

For most simple ranges, our results will be identical (e.g "A" .. "Q").

For the above examples, I would end up producing:

1: Alpha through Omega greek capital letters
2: ĀĂĄĆĈĊČĎĐĒ
(and オカガキギクグケゲコゴサザシジスズセゼソゾタダチヂツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモヤユヨラリルレロワヰヱヲンヴヷヸヹヺ
for the Katakana)
3: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

That seems pretty darned intuitive to me. Mind you, "A" .. "ž" is still ugly
as sin in terms of ordering once you listify, and I can't reasonably fix
that without re-defining Unicode or having a really, really convoluted and
special-case rule, but without getting convoluted, even that ugly example
does something useful and, I dare say, intuitive for testing membership.

Here's the pseudo-code for my suggestion:

  class SingleCharAlphaRange {
     has $.start;
     has $.end;
     # Verify that this is a single character string which is valid
     # and non-combining/non-modifying and represented by
     # one and only one codepoint.
     method valid(Str $s --> Bool) {
        # Assert that this is valid Unicode, 1 codepoint string which is not
a
        # combining or modifying codepoint....
     }
     # Is $s in this range?
     method in-range(Str $s --> Bool) {
        return fail() unless self.valid($s); # "abc" ~~ "a" .. "z"
        return True if self.start eq $s or self.end eq $s; # "a" ~~ "a" ..
"z"
        return False if $s.ord < self.start.ord; # "a" ~~ "b" .. "z"
        return False if $s.ord > self.end.ord; # "z" ~~ "a" .. "y"
        my @props-a = self.props_start; # get script and properties for
$.start
        my @props-b = self.props_end; # ' ' $.end
        my @props-s = self.props($s);  # ' ' $s
        if @props-a ~~ @props-s or @props-b ~~ @props-s {
           return True; "b" ~~ "a" .. "z"
        }
     }
     ...
     method list() {
       gather do {
         for self.start.ord .. self.end.ord -> $i {
           take chr($i) if self.inrange(chr($i));
         }
       }
     }

2) otherwise, call .succ on the LHS. Stop before the generated values
> exceed the RHS.
>

Isn't that what you do when you try to listify a range? A range doesn't do
any of that unless you try to walk it. What happens when you ask if a value
is in the range "AA" .. "zz"? Do you iterate through every possible value
and then return false if nothing matched?

> I'm not convinced it should be any more complicated than that.

If you have an idea that makes it simpler, I'm all ears. But I don't see
anything that makes it simpler than my suggestion FOR THE USER. For us, it
can be absurdly complex, but I would hope we could keep it simple for the
user.

PS: Notice that 5..1 would have to be 5,4,3,2,1 for this proposal to really
make sense, which I believe it needs to. After all, currently
(1..1e10).reverse will blow up, and there's no really good way around that.
It would be much simpler to just be able to say "1e10 .. 1" and there's not
really a reason I can think of not to that doesn't boil down to "people
expect that to fail because of Perl 5."

PPS: Other unexpected results in Rakudo, all related to the behavior that
Rakudo seems to have around ranges that it doesn't think are legitimate for
ranges: it repeats the LHS infinitely:

 "䷀" .. "䷿"  - expected: all hexagram characters; got: first character,
infinitely repeating.
"鐀" .. "鐅" - expected: all CJK Unified Ideographs between u+9400 and u+9405;
got: first character, infinitely repeating.
"٠" .. "٩" - expected: all Arabic-Indic digits zero through nine; got: first
digit (zero) repeating (note: bidi may confuse display in this email)
"א" .. "ת" - expected: all Hebrew letters; got: first character (א)
repeating (note: bidi may confuse display in this email)
"Ａ" .. "Ｅ" - expected: all full width, capital letters A through E; got:
full width A repeating.

-- 
Aaron Sherman
Email or GTalk: a...@ajs.com
http://www.ajs.com/~ajs

Re: Suggested magic for "a" .. "b"

Reply via email to