Re: String Theory

Larry Wall Mon, 28 Mar 2005 12:23:37 -0800

On Mon, Mar 28, 2005 at 11:53:07AM -0500, Chip Salzenberg wrote:
: According to Larry Wall:
: > On Fri, Mar 25, 2005 at 07:38:10PM -0000, Chip Salzenberg wrote:
: > : And might I also ask why in Perl 6 (if not Parrot) there seems to be
: > : no type support for strings with known encodings which are not subsets
: > : of Unicode?
: > 
: > Well, because the main point of Unicode is that there *are* no encodings
: > that cannot be considered subsets of Unicode.
: 
: Certainly the Unicode standard makes such a claim about itself.  There
: are people who remain unpersuaded by Unicode's advertising.  I conclude
: that they will find Perl 6 somewhat disappointing.


If it turns out to be a Real Problem, we'll fix it.  Right now I think
it's a Fake Problem, and we have more important things to worry about.
Most of the carping about Unicode is with regard to CJK unifications
that can't be represented in any one existing character set anyway.
Unicode has at least done pretty well with the round-trip guarantee for
any single existing character set.  There are certainly localization
issues with regard to default input and output transformations, and
things like changing the default collation order from Unicodian to
SJISian or Big5ian or whatever.  But those are good things to make
explicit in any event, and that's what the language-dependent level
is for.  And people who are trying to write programs across language
boundaries are already basically screwed over by their national
character sets.  You can't even go back and forth between Japanese
and English without getting all fouled up between Ą and \.  Unicode
distinguishes them, so it's a distinction that Perl 6 *always makes*.

That being said, there's no reason in the current design that a string
that is viewed as on the language level as, say, French couldn't
actually be encoded in Morse code or some such.  It's *only* the
abstract semantics at the current Unicode level that are required to
be Unicode semantics by default.  And it's as lazy as we care to make
it--when you do s/foo/bar/ on a string, it's not required to convert
the string from any particular encoding to any other.  It only has to
have the same abstract result *as if* you'd translated it to Unicode
and then back to whatever the internal form is.  Even if you don't want
to emulate Unicode in the API, there are options.  For some problems
it'd be more efficient to do translate lazily, and for others it's
more efficient to just translate everything once one input and once
on output.  (It also tends to be a little cleaner to isolate "lossy"
translations to one spot in the program.  By the round-trip nature
of Unicode, most of the lossy translations would be on output.)

But anyway, a bit about my own psychology.  I grew up as a preacher's
kid in a fundamentalist setting, and I heard a lot of arguments of the
form, "I'm not offended by this, but I'm afraid someone else might be
offended, so you shouldn't do it."  I eventually learned to discount
such arguments to preserve my own sanity, so saying "someone might
be disappointed" is not quite sufficient to motivate me to action.
Plus there are a lot of people out there who are never happy unless
they have something to be unhappy about.  If I thought that I could
design a language that will never disappoint anyone, I'd be a lot
stupider than I already think I am, I think.

All that being said, you can do whatever you like with Parrot, and
if you give a decent enough API, someone will link it into Perl 6.  :-)

Larry

Re: String Theory

Reply via email to