Dan Sugalski wrote:
> >The string API should be sufficiently smart to be able to convert data
from
> >one encoding to another as it's more convenient.
>
> No, the vtable functions for the variables should know how to convert from
> and to perl's preferred string representations, and can do whatever
Bizarre
> Magic they care to iternally.
>

I don't see why Perl couldn't deal with multiple representations internally.
Conversion could be done on the way in, internally for efficiency on certain
operations, and on the way out, again.


> >On the other side, for a string that is matched against regexps, it
doesn't
> >matter much if it has variable character length, since regexps normally
read
> >all the string anyway, and indexing characters isn't much of a concern.
>
> You underestimate the impact of variable-length data, I think. Regexes
> should go rather faster on fixed-length than variable length data. How
much
> so depends on your processor. (I can guarantee that Alphas will run a
> darned sight faster on UTF-32 than UTF-8...)
>

Aggreed. Should go faster. But maybe I don't need it that fast!
(I really think it shouldn't be so much slower than doing it on an ASCII
string with the same total buffer size, it only would have to fetch another
byte on certain conditions and build the extended character representation,
what isn't hard either.)


> >It would be nice if the user had some control to this, for example by
saying
> >"I don't care this string will be used by substr, leave it in UTF-8 since
> >it's too big and I don't want to waste memory!", or "This string isn't
too
> >big, so I should convert it to bloated UTF-32 at once!", or even "use
less
> >'memory';".
>
> That would be:
>    my str $foo : utf8 : fixed;
> or possibly
>    use less qw(memory);
>

Probably not my str $foo :utf8 :fixed, since then if I have $bar = $foo it
would convert the string value from $foo to anything else, right?


> Generally speaking you probably don't want to do this. Odds are if you
> think you know what's going on better than the compiler, you're wrong.
(Not
> always, but in a non-trivial number of cases, in my experience)
>

I can't beat the compiler, that's for sure. But I really don't think I want
to read a 100KB file into a variable all at once and end up with 400KB
memory usage only for that file. And I really don't care if `regexps' go
slower on that, I can live with it...


> >And I believe 8-bit ASCII will always be an option, for who doesn't care
> >about extended characters and want the best of both worlds on speed and
> >memory usage.
>
> 8-bit characters in general, yep. (ASCII is really 7-bit) ASCII, EBCDIC,
or
> raw byte buffers.
>

That includes Latin-1, Latin-etc. (I believe they're 10 or 12), which are
the same as the ISO-8859-1, ISO-8859-(etc).

- Branden


Reply via email to