On Sat, Mar 26, 2005 at 02:37:24PM -0600, Rod Adams wrote: : Larry Wall wrote: : : >%+ and %- are gone. $0, $1, $2, etc. are all objects that know : >where they .start and .end. (Mind you, those methods return magical : >positions that are Unicode level independent.) : > : How can you have a level independent position?
By not confusing positions with numbers. They're just pointers into a particular string. : The matching itself happens at a specified level. (Note that which level : the match happens at can change what is matched.) So it makes sense that : all the positions that come out of it are in terms of that level. When we're dealing with mostly variable length encodings, it makes more sense that the positions come out as string pointers that only convert to numbers grudgingly under duress. If you're just going to feed a position back into a substr() or as the start position of the next index(), there's no reason to translate it to a number and back to a pointer. It's a lot more efficient if you don't. : Now, that position can be translated to a lower level, but not to an : upper level, since you can happily land in the middle of a char. I talked about this problem in one of the As. I think the fail soft approach is to round to the next "ceiling" boundary and issue a warning. : This is part of what I'm having trouble with your concept of a Str being : at several levels at once: There's no reliable way to have a notion of : "position", expect to have it as attached to the highest possible level, : and the second someone does something at lower level, you void the : position, and possibly the ability to remain at that high level. A position that is a pointer can be true for all levels simultaneously. It has the additional benefit of a type that is subtype constrained to operate with other values from the same string, so if you subtract two pointers from different strings, you can actually detect the error. : I still see my notion of a Str having only one level and encoding at a : time as being preferable. Having the ability to recast a string to other : levels/encoding should be easy, and many builtins should do that : recasting for you. And I still see that you can have your view if you install a pragma that forces all incoming strings to a single level. But I think we can do that lazily, or not at all, in many cases. The basic underlying problem is that there is no simple mapping from math to Unicode. The language that lets people express their solution in terms of Unicode instead of in terms of math is going to have a leg up on the future, at least in the Unicode problem space. Strings were never arrays in Perl, and they're only getting further apart as the world makes greater demands on strings to represent human language. So I'd much rather introduce an abstraction like "string position" now that is not a number. It's a dimensional value, where the scaling of the dimensionality is bound to a particular string. You can have a pragma that says, "Untyped numbers are assumed to be meters, kilograms, and seconds", and a different lexical scope might have a pragma that says "Untyped numbers are assumed to be centimeters, grams, and seconds." These scopes can get along as long as they don't try to exchange untyped integers. Or if they do, they have some way of ascertaining what an untyped integer meant when it was generated. : I do _not_ see $/ & friends getting ported across a recasting. .pos can : be translated if new level <= old level, otherwise gets set to undef. The interesting thing about a pointer is that you can pass it through a higher level transparently as long as you don't actually try to use it. But if you do try to use it, I think undef is overkill. Just as a float stuffed into an int truncates, we should just pick a direction to find the next boundary and go from there, maybe with a loss of precision warning. The right way to suppress the warning would be to install an explicit function that rounds up or down. : Please convince me your view works in practice. I'm not seeing it work : well when I attempt to define the relevent parts of S29. But I might : just be dense on this. Well, let's work through an example. multi method substr(Str $s: Ptr $start, PtrDiff ?$len, Str ?$repl) Depending on the typology of Ptr and PtrDiff, we can either coerce various dimensionalities into an appropriate Ptr and PtrDiff type within those classes, or we could rely on MMD to dispatch to a suite of substr implementations with more explicit classes. Interestingly, since Ptrs aren't integers, we might also allow multi method substr(Str $s: Ptr $start, Ptr ?$end, Str ?$repl) which might be a more natural way to deal with variable length encodings, and we just leave the "lengthy" version in there for old times sake. We could go as far as to allow a range as the second argument: $x = substr($a, $start..^$end); or its evil twin: $x = $a[$start..^$end]; Of course, with the evil twin notation you lose the "repl" facility. But in fact, we probably can't actually allow both of multi method substr(Str $s: Ptr $start, PtrDiff ?$len, Str ?$repl) multi method substr(Str $s: Ptr $start, Ptr ?$end, Str ?$repl) since MMD might well tie on who to dispatch substr($a, 5, 10) to, unless we forced it to Perl 5 interpretation in case of tie. So I'm guessing we put in $x = $a[$start..^$end]; for the non-destructive slicing of a string, and leave substr() with Perl 5 semantics, in which case it's just a SMOP to coerce the user's substr($a, 5, 10); to something the effectively means substr($a, Ptr.new($a, 5, $?UNI_LEVEL), PtrDiff.new(10, $?UNI_LEVEL)); Actually, in this case, I expect we're actually calling into multi method substr(Str $s: PtrDiff $start, PtrDiff ?$len, Str ?$repl) where $start will be counted from the begining of the string, so the call is effectively substr($a, PtrDiff.new(5, $?UNI_LEVEL), PtrDiff.new(10, $?UNI_LEVEL)); Okay, that looks scary, but if as in my previous message we define "chars" as the highest Unicode level allowed by the context and the string, then we can just write that in some notation resembling: substr($a, 5`Chars, 10`Chars); or whatever notation we end up with for labeling units on numbers. Even if we don't define "chars" that way, they just end up labeled with the current level (here assuming "Codes"): substr($a, 5`Codes, 10`Codes); or whatever. But this is all implicit, which is why you can just write substr($a, 5, 10); and have it DWYM. Now, I admit that I've handwaved the tricksy bit, which is, "How do you know, Larry, that substr() wants 5`Codes rather than 5`Meters? It's all very well if you have a single predeclared subroutine and can look at the signature at compile time, but you wrote those as multi methods up above, so we don't know the signature at compile time." Well, that's correct, we don't know it at compile time. But what *do* we know? We know we have a number, and that it was generated in a context where, if you use it like a string position, it should turn into a number of code points, and if you use it like a weight, it should turn into a number of kilograms (or pounds, if you're NASA). In other words, the effective type of that literal 5 is not "Int", but "Int|Codes|Meters|Kilograms|Seconds|Bogomips" or some such. And if MMD can handle that in an argument type and match up Codes as a subtype of Ptr, and if we write our method signature only in the abstract types like Ptr, we're pretty much home free. That certainly simplifies how you write S29, though I don't know if the MMD folks will be terribly happy with the notion of dispatching arguments with junctional types. But it does fall out of the design rather naturally. I know that offhand this seems like a lot of needless complication, and it's is a little bit hard to explain, but I believe it's very close to how we actually use context in real language, so I think people will find it intuitively obvious once they start using it. More to the point, it will produce very *clean* code, uncluttered with most of the crufty conversions and coercions you find in singly typed languages that compiler writers tend to love, and programmers tend to hate. And you can still put in all that cruft if you want to. You can even force yourself to have to do it. But to me, it feels a bit like slavery, so I'm still looking for a land flowing with milk and honey, even if there are a few giants in it. Er, sorry for waxing poetic. Larry