Re: S28ish [was: [Pugs] A couple of string interpolation edge cases]

Larry Wall Sat, 26 Mar 2005 19:29:29 -0800

On Sat, Mar 26, 2005 at 02:37:24PM -0600, Rod Adams wrote:
: Larry Wall wrote:
: 
: >%+ and %- are gone.  $0, $1, $2,  etc. are all objects that know
: >where they .start and .end.  (Mind you, those methods return magical
: >positions that are Unicode level independent.)
: >
: How can you have a level independent position?


By not confusing positions with numbers.  They're just pointers into
a particular string.

: The matching itself happens at a specified level. (Note that which level 
: the match happens at can change what is matched.) So it makes sense that 
: all the positions that come out of it are in terms of that level.

When we're dealing with mostly variable length encodings, it makes
more sense that the positions come out as string pointers that only
convert to numbers grudgingly under duress.  If you're just going to
feed a position back into a substr() or as the start position of the
next index(), there's no reason to translate it to a number and back
to a pointer.  It's a lot more efficient if you don't.

: Now, that position can be translated to a lower level, but not to an 
: upper level, since you can happily land in the middle of a char.

I talked about this problem in one of the As.  I think the fail soft
approach is to round to the next "ceiling" boundary and issue a warning.

: This is part of what I'm having trouble with your concept of a Str being 
: at several levels at once: There's no reliable way to have a notion of 
: "position", expect to have it as attached to the highest possible level, 
: and the second someone does something at lower level, you void the 
: position, and possibly the ability to remain at that high level.

A position that is a pointer can be true for all levels simultaneously.
It has the additional benefit of a type that is subtype constrained
to operate with other values from the same string, so if you subtract
two pointers from different strings, you can actually detect the error.

: I still see my notion of a Str having only one level and encoding at a 
: time as being preferable. Having the ability to recast a string to other 
: levels/encoding should be easy, and many builtins should do that 
: recasting for you.

And I still see that you can have your view if you install a pragma
that forces all incoming strings to a single level.  But I think we
can do that lazily, or not at all, in many cases.

The basic underlying problem is that there is no simple mapping from
math to Unicode.  The language that lets people express their solution
in terms of Unicode instead of in terms of math is going to have a leg
up on the future, at least in the Unicode problem space.  Strings were
never arrays in Perl, and they're only getting further apart as the
world makes greater demands on strings to represent human language.

So I'd much rather introduce an abstraction like "string position"
now that is not a number.  It's a dimensional value, where the scaling
of the dimensionality is bound to a particular string.  You can have
a pragma that says, "Untyped numbers are assumed to be meters, kilograms,
and seconds", and a different lexical scope might have a pragma that
says "Untyped numbers are assumed to be centimeters, grams, and seconds."
These scopes can get along as long as they don't try to exchange untyped
integers.  Or if they do, they have some way of ascertaining what an
untyped integer meant when it was generated.

: I do _not_ see $/ & friends getting ported across a recasting. .pos can 
: be translated if new level <= old level, otherwise gets set to undef.

The interesting thing about a pointer is that you can pass it through
a higher level transparently as long as you don't actually try to
use it.  But if you do try to use it, I think undef is overkill.
Just as a float stuffed into an int truncates, we should just pick
a direction to find the next boundary and go from there, maybe with
a loss of precision warning.  The right way to suppress the warning
would be to install an explicit function that rounds up or down.

: Please convince me your view works in practice. I'm not seeing it work 
: well when I attempt to define the relevent parts of S29. But I might 
: just be dense on this.

Well, let's work through an example.

    multi method substr(Str $s: Ptr $start, PtrDiff ?$len, Str ?$repl)

Depending on the typology of Ptr and PtrDiff, we can either coerce
various dimensionalities into an appropriate Ptr and PtrDiff type
within those classes, or we could rely on MMD to dispatch to a suite
of substr implementations with more explicit classes.  Interestingly,
since Ptrs aren't integers, we might also allow

    multi method substr(Str $s: Ptr $start, Ptr ?$end, Str ?$repl)

which might be a more natural way to deal with variable length encodings,
and we just leave the "lengthy" version in there for old times sake.

We could go as far as to allow a range as the second argument:

    $x = substr($a, $start..^$end);

or its evil twin:

    $x = $a[$start..^$end];

Of course, with the evil twin notation you lose the "repl" facility.

But in fact, we probably can't actually allow both of

    multi method substr(Str $s: Ptr $start, PtrDiff ?$len, Str ?$repl)
    multi method substr(Str $s: Ptr $start, Ptr ?$end, Str ?$repl)

since MMD might well tie on who to dispatch

    substr($a, 5, 10)

to, unless we forced it to Perl 5 interpretation in case of tie.  So I'm
guessing we put in

    $x = $a[$start..^$end];

for the non-destructive slicing of a string, and leave substr() with
Perl 5 semantics, in which case it's just a SMOP to coerce the user's

    substr($a, 5, 10);

to something the effectively means

    substr($a, Ptr.new($a, 5, $?UNI_LEVEL), PtrDiff.new(10, $?UNI_LEVEL));

Actually, in this case, I expect we're actually calling into

    multi method substr(Str $s: PtrDiff $start, PtrDiff ?$len, Str ?$repl)

where $start will be counted from the begining of the string, so the
call is effectively

    substr($a, PtrDiff.new(5, $?UNI_LEVEL), PtrDiff.new(10, $?UNI_LEVEL));

Okay, that looks scary, but if as in my previous message we define
"chars" as the highest Unicode level allowed by the context and the
string, then we can just write that in some notation resembling:

    substr($a, 5`Chars, 10`Chars);

or whatever notation we end up with for labeling units on numbers.
Even if we don't define "chars" that way, they just end up labeled with
the current level (here assuming "Codes"):

    substr($a, 5`Codes, 10`Codes);

or whatever.

But this is all implicit, which is why you can just write

    substr($a, 5, 10);

and have it DWYM.

Now, I admit that I've handwaved the tricksy bit, which is, "How do
you know, Larry, that substr() wants 5`Codes rather than 5`Meters?
It's all very well if you have a single predeclared subroutine and
can look at the signature at compile time, but you wrote those as multi
methods up above, so we don't know the signature at compile time."

Well, that's correct, we don't know it at compile time.  But what
*do* we know?  We know we have a number, and that it was generated
in a context where, if you use it like a string position, it should
turn into a number of code points, and if you use it like a weight,
it should turn into a number of kilograms (or pounds, if you're NASA).

In other words, the effective type of that literal 5 is not "Int",
but "Int|Codes|Meters|Kilograms|Seconds|Bogomips" or some such.
And if MMD can handle that in an argument type and match up Codes as
a subtype of Ptr, and if we write our method signature only in the
abstract types like Ptr, we're pretty much home free.  That certainly
simplifies how you write S29, though I don't know if the MMD folks
will be terribly happy with the notion of dispatching arguments with
junctional types.  But it does fall out of the design rather naturally.

I know that offhand this seems like a lot of needless complication,
and it's is a little bit hard to explain, but I believe it's very close
to how we actually use context in real language, so I think people
will find it intuitively obvious once they start using it.  More to
the point, it will produce very *clean* code, uncluttered with most of
the crufty conversions and coercions you find in singly typed languages
that compiler writers tend to love, and programmers tend to hate.

And you can still put in all that cruft if you want to.  You can even
force yourself to have to do it.  But to me, it feels a bit like slavery,
so I'm still looking for a land flowing with milk and honey, even if
there are a few giants in it.

Er, sorry for waxing poetic.

Larry

Re: S28ish [was: [Pugs] A couple of string interpolation edge cases]

Reply via email to