Austin Hastings <[EMAIL PROTECTED]> writes: > A couple of alternatives: > > substr.bytes($string, 2, 4) = $substitute;
Well, that's arguably better than bsubstr. > substr($string.bytes, 2, 4) = $substitute; I could live with that, although it doesn't allow mixing units. (Someone will pop in here and say that's to be construed as a feature.) > # Make it a pragma > use String(bytes); > substr($string, 2, 4) = substitute; I think a pragma should set the default unit for the current lexical scope, at least. (The default, in the absense of the pragma, is an open question; at worst the default could be to throw an exception if units aren't specified; personally I think throwing exceptions willy nilly is unPerlish.) > # Make it a global mode > set_string_mode(bytes); > substr($string, 2, 4) = substitute; I don't like this. It's no more useful than the pragma but has bigger caveats. > # Make it an object mode > $string.access_mode(bytes); > substr($string, 2, 4) = $substitute; Wouldn't this add extra operations all over the place? >> The word "bytes" is clearly much too long, though, much less >> "graphemes" or "codepoints". I thought about this: >> >> substr($string, 2b, 4b) = $substitute; > > Problems with: > > substr($string, 0b, 1b) = $substitute; > > Is that binary or bytes? Also: I figured it would conflict with something. > substr($string, $start b, $end b) = $substitute; > > Looks unintuitive. *shrug*. I chose it because I thought the other way around looked unintuitive: substr($string, b $start, b $end) = $substitute; That looks like calling a function -- which *is* what's going on, under the hood, but the other way around looks like tagging on units, which seems more natural to me. >> With presumably g and c for graphemes and codepoints, but I rather >> suspect that might conflict with some other existing syntax (though I >> can't think of anything in particular). > > 0c? 0x16c ? Ick, yes, I missed that. (I was thinking only of numbers specified in decimal.) I knew there'd be something. >> codes and graphs is better than codepoints and graphemes, at least. > > In certain (IMO large) sectors of the Perl community, string > processing is just about all the work there is. I submit that there > needs to be a way to drive the token length to 0: either a pragma, > or a global mode, or a type definition. A pragma should set the default, IMO. I think what we're talking about here is what the syntax would be for using a unit other than the default, or for specifying the units if you haven't used the pragma to set the default. >> You could coin the abbreviation ligs, for Language Independent >> Graphemes. Then some ingenious rascal can create a pragma or >> whatever that allows $str.b, $str.c, $str.g, and $str.l for >> fans of terseness. > > As opposed to 'ligs' meaning ligatures? Fraught with peril. :-) I thought about that, but figured it wasn't a big deal; there are *lots* of abbreviations with more than one possible interpretation, and you just deal with having to know which one is meant. However, it was then pointed out that it would actually be ldgs, which IMO is unpronounceable and ugly. So something else is needed for those. *shrug*. Make up a word. Call them woohickies for all I care and abbreviate it woo or just w. > I like graphemes for the default because I hate and fear > graphemes. The whole *code thing just crawls right in my ear, so > having the language transparently support it would be a win. I can see the logic in that. Personally I don't care what the default is. Almost none of my code will need to care one way or the other, and that which does can use the pragma. Have the implications of the bytes/codepoints/graphemes/woohickies distinction for the regular expression engine been discussed already? -- $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,"[EMAIL PROTECTED]/ --";$\=$ ;-> ();print$/