Re: S28ish [was: [Pugs] A couple of string interpolation edge cases]

Rod Adams Mon, 28 Mar 2005 23:07:39 -0800

Larry Wall wrote:

On Sat, Mar 26, 2005 at 02:37:24PM -0600, Rod Adams wrote:

: Please convince me your view works in practice. I'm not seeing it work : well when I attempt to define the relevent parts of S29. But I might : just be dense on this.

Well, let's work through an example.

   multi method substr(Str $s: Ptr $start, PtrDiff ?$len, Str ?$repl)

Depending on the typology of Ptr and PtrDiff, we can either coerce
various dimensionalities into an appropriate Ptr and PtrDiff type
within those classes, or we could rely on MMD to dispatch to a suite
of substr implementations with more explicit classes.  Interestingly,
since Ptrs aren't integers, we might also allow

   multi method substr(Str $s: Ptr $start, Ptr ?$end, Str ?$repl)

which might be a more natural way to deal with variable length encodings,
and we just leave the "lengthy" version in there for old times sake.
...snip...
for the non-destructive slicing of a string, and leave substr() with
Perl 5 semantics, in which case it's just a SMOP to coerce the user's

   substr($a, 5, 10);

to something the effectively means

   substr($a, Ptr.new($a, 5, $?UNI_LEVEL), PtrDiff.new(10, $?UNI_LEVEL));

Actually, in this case, I expect we're actually calling into

   multi method substr(Str $s: PtrDiff $start, PtrDiff ?$len, Str ?$repl)

where $start will be counted from the begining of the string, so the
call is effectively

   substr($a, PtrDiff.new(5, $?UNI_LEVEL), PtrDiff.new(10, $?UNI_LEVEL));

Okay, that looks scary, but if as in my previous message we define
"chars" as the highest Unicode level allowed by the context and the
string, then we can just write that in some notation resembling:

   substr($a, 5`Chars, 10`Chars);

or whatever notation we end up with for labeling units on numbers.
Even if we don't define "chars" that way, they just end up labeled with
the current level (here assuming "Codes"):

   substr($a, 5`Codes, 10`Codes);

or whatever.

But this is all implicit, which is why you can just write

   substr($a, 5, 10);

and have it DWYM.

I see some danger here. In particular, there is a huge difference between a Ptr (position), and a PtrDiff (length). I'm going to rename these classes StrPos and StrLen for the time being.

A StrPos can have multiple char units associated with it, and has the ability morph between them. However, it is also strictly bound to a given string.

A StrLen can only have one char unit associated with it, since there is no binding string and anchors with which to reliably map how many cpts there are to so many lchars.

I see the following operations being possible at a logical level:

StrPos = StrPos + StrLen StrLen = StrPos - StrPos # must specify units (else implied), and must be same base Str StrLen = StrLen + StrLen # if same units. StrLen = StrLen + Int So I see the following cases of Substr happening:

 multi sub substr(Str $s, StrPos $start  : StrPos ?$end,     ?$replace)

Where $start and $end must be anchored to $s

 multi sub substr(Str $s, StrPos $start,   StrLen $length  : ?$replace)

Same restriction on $start,

 multi sub substr(Str $s, StrLen $offset : StrLen ?$length,  ?$replace)

Where $offset gets used as C<$s.start + $offset> and kicked over to case #2.

Hmm. Okay, that's not dangerous, just a lot to look at.

What gets dangerous is letting users think of a StrPos as a number, since it's not. Only StrLen's get to pretend to be numbers. StrPos should have some nifty methods to return StrLen's relative to it's base Str's .start, and those StrLens can look like a number, but the StrPos never gets to ever look like a number.

Make it where StrLen "does Int", and there's a C«coerce:<as>(Int,StrLen)» with default units of your "Chars as highest supported by string applied to", and I think we're getting somewhere.

We need to define what happens to a StrPos when it's base Str goes away. Having it assume some nifty flavor of undef would do the trick. This implies that a Str knows all the StrPos's hanging off it, so the destructor can undef them. But that shouldn't pose a problem for p6c.

Now, I admit that I've handwaved the tricksy bit, which is, "How do
you know, Larry, that substr() wants 5`Codes rather than 5`Meters?
It's all very well if you have a single predeclared subroutine and
can look at the signature at compile time, but you wrote those as multi
methods up above, so we don't know the signature at compile time."
Well, that's correct, we don't know it at compile time. But what *do* we know? We know we have a number, and that it was generated in a context where, if you use it like a string position, it should turn into a number of code points, and if you use it like a weight, it should turn into a number of kilograms (or pounds, if you're NASA).

I don't see the need for all this. Make a C«coerce:<as>(Int,StrLen)» as mentioned above, and the MMD should be able to figure out that it can take the Int peg and hammer it into the StrLen hole. Then leave it up to the coerce sub to complain if the Int happens to have units that make the peg not fit.

I know that offhand this seems like a lot of needless complication,
and it's is a little bit hard to explain, but I believe it's very close
to how we actually use context in real language, so I think people
will find it intuitively obvious once they start using it.

As one who is seriously thinking about diving head first into the world of Natural Language Processing (aka getting a PhD in it), I can tell you that determining how we humans actually infer context out of language is a mind boggling complex task, and that there are no simple rules to it. It is still very much the case that it's easier to teach humans to talk like a computer than have a computer understand the human language (easier here is defined as simply being possible). We can make the computer language look and feel a lot more like a "regular" language, but in the end you're still training the human, not the computer. (Fixing this problem is what appeals me to the NLP arena.)

That said, I think that most any solution we pick that let's us easily dance up and down the Unicode tree with something significantly less onerous than Java will become near intuitive once people start using it. Consider that anyone on this list has what amounts to an intuitive understanding of what

   next if /^\s*$/;

means, even though it looks little like the pure English equivalent: "Skip blank lines." (Though we did put the verb first in each case).

What counts is how much translation has to be done from what's in the programmer's head into something the computer can grok unambiguously. This is not necessarily the same thing as matching how we use language, but I will agree there are often corollaries. The Perl programmer thinking something like "Skip blank lines" will translate that to C<next if /^\s*$/;>, whereas the one thinking "When I get a blank line, skip it" will generate C<if /^\s*$/ {next}>.

Where I'm going with this: your statement of "it's very close to how we actually use context in real language" is better said "it's very close to one of the common ways we actually use context in real language". The OO method is another common way, where the Direct Object of the sentence is everything. OOP also happens to be a useful way of mapping certain tasks into a computer language. There are many other context models that we use, none significantly better than the rest in general, but each can stomp any other way in specific.

Therefore, TMTOWTDI is a Good Thing.

Since strings are so fundamental to what Perl is, we should be able to support several context models and WTDI at once, without prejudice or having to declare what we're doing too much. Now all that's left is figuring out which contexts are meaningful, and figure out how to get them all at once.


-- Rod Adams

PS - I don't think we're that far away from each other on this stuff. We're just looking at it from different sides.

Re: S28ish [was: [Pugs] A couple of string interpolation edge cases]

Reply via email to