Austin Hastings <[EMAIL PROTECTED]> writes:

> A couple of alternatives:
>
>   substr.bytes($string, 2, 4) = $substitute;

Well, that's arguably better than bsubstr.

>   substr($string.bytes, 2, 4) = $substitute;

I could live with that, although it doesn't allow mixing units.
(Someone will pop in here and say that's to be construed as a
feature.)

>   # Make it a pragma
>   use String(bytes);         
>   substr($string, 2, 4) = substitute;

I think a pragma should set the default unit for the current lexical
scope, at least.  (The default, in the absense of the pragma, is an
open question; at worst the default could be to throw an exception if
units aren't specified; personally I think throwing exceptions willy
nilly is unPerlish.)

>   # Make it a global mode
>   set_string_mode(bytes);
>   substr($string, 2, 4) = substitute;

I don't like this.  It's no more useful than the pragma but has bigger
caveats.

>   # Make it an object mode
>   $string.access_mode(bytes);
>   substr($string, 2, 4) = $substitute;

Wouldn't this add extra operations all over the place?

>> The word "bytes" is clearly much too long, though, much less
>> "graphemes" or "codepoints".  I thought about this:
>> 
>> substr($string, 2b, 4b) = $substitute;
>
> Problems with:
>  
>   substr($string, 0b, 1b) = $substitute;
>
> Is that binary or bytes? Also:

I figured it would conflict with something.

>   substr($string, $start b, $end b) = $substitute;
>
> Looks unintuitive.

*shrug*.  I chose it because I thought the other way around looked
unintuitive:
substr($string, b $start, b $end) = $substitute;

That looks like calling a function -- which *is* what's going on,
under the hood, but the other way around looks like tagging on units,
which seems more natural to me.

>> With presumably g and c for graphemes and codepoints, but I rather
>> suspect that might conflict with some other existing syntax (though I
>> can't think of anything in particular).
>
> 0c? 0x16c ?

Ick, yes, I missed that.  (I was thinking only of numbers specified in
decimal.)  I knew there'd be something.

>> codes and graphs is better than codepoints and graphemes, at least.
>
> In certain (IMO large) sectors of the Perl community, string
> processing is just about all the work there is. I submit that there
> needs to be a way to drive the token length to 0: either a pragma,
> or a global mode, or a type definition.

A pragma should set the default, IMO.  I think what we're talking
about here is what the syntax would be for using a unit other than the
default, or for specifying the units if you haven't used the pragma to
set the default.

>> You could coin the abbreviation ligs, for Language Independent
>> Graphemes.  Then some ingenious rascal can create a pragma or
>> whatever that allows $str.b, $str.c, $str.g, and $str.l for 
>> fans of terseness.
>
> As opposed to 'ligs' meaning ligatures? Fraught with peril. :-)

I thought about that, but figured it wasn't a big deal; there are
*lots* of abbreviations with more than one possible interpretation,
and you just deal with having to know which one is meant.  However, it
was then pointed out that it would actually be ldgs, which IMO is
unpronounceable and ugly.  So something else is needed for those.

*shrug*.  Make up a word.  Call them woohickies for all I care and
abbreviate it woo or just w.

> I like graphemes for the default because I hate and fear
> graphemes. The whole *code thing just crawls right in my ear, so
> having the language transparently support it would be a win.

I can see the logic in that.  Personally I don't care what the default
is.  Almost none of my code will need to care one way or the other,
and that which does can use the pragma.

Have the implications of the bytes/codepoints/graphemes/woohickies
distinction for the regular expression engine been discussed already?

-- 
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}}
split//,"[EMAIL PROTECTED]/ --";$\=$ ;-> ();print$/

Reply via email to