Jeff:
# This will likely open yet another can of worms, but Unicode has been
# delayed for too long, I think. It's time to add the Unicode libraries
# (In our case, the ICU libraries at <http://oss.software.ibm.com/icu/>,
# which Larry has now blessed) to Parrot. string.c already has
# (admittedly
# unavoidable, due to the library not being included)
# assumptions such as
# isdigit(). So, I have a few thoughts (that may have already been shot
# down by people wiser than I in such matters) to explicate, and some
# questions to ask.
#
# ICU should be added as a note in the README, and maybe to 'INSTALL' if
# we ever create one. Let's not add it to CVS, as it's not under our
# control. If we have to patch ICU to make it work correctly
# with Parrot,
# the patches should be submitted back to the ICU team. And I'm joining
# the appropriate mailing lists to keep appraised of development.

I *really* strongly suggest we include ICU in the distribution.  I
recently had to turn off mod_ssl in the Apache 2 distro because I
couldn't get OpenSSL downloaded and configured.

We also need to make sure ICU will work everywhere.  And I do mean
*everywhere*.  Will it work on VMS?  Palm OS?  Crays?

# Before Unicode goes into full swing, I need some idea of how
# we're going
# to deploy the libraries. On this note, I defer to the
# Configure master,
# Brent. I've already done some work with ICU, so I'm reasonably
# comfortable with migrating in one Unicode bit at a time, until we're
# ready for full UTF-16 compliance.
#
# The RE engine should (I'm speaking without having recently read the
# source, so feel free to correct me) not need to be migrated, as it's
# already using UTF-32 internally, which leaves just the string
# internals.
# These can be migrated to using ICU macros fairly easily (I've already
# done some of the work locally), so I think the main focus should be on
# encodings, as we'll have to eventually support the more common
# wide-character encodings such as KOI-8 and BIG5.

There are a few things that need to change, but they aren't big issues.
Mostly it's just places where character sets have been presumed.

However, I'm seriously thinking about a major re-architecture of the
regex engine, which would probably help these sorts of issues.

# I still have some questions about using UTF-16 internally for string
# representation (as mentioned in
#
<http:[EMAIL PROTECTED]/msg07856.html>),
# but I've resolved most of those. It's an excellent match for the ICU
# library, as it uses UTF-16 internally. My only question is if we're
# going to incur a performance hit every time a scalar is transferred to
# the RE engine, as it uses UTF-32 internally.

That can change.  However, utf32 seems like the best match, as it would
allow us to reach into a string's guts for speed.  (We don't currently
do that, but if I do redesign the engine, I'll probably be able to.)

# Also, once we have UTF-16 running internally, I'd be interested in
# seeing what memory consumption looks like vs. UTF-32, beause I'd like
to
# see if it makes sense to add a compile-time switch between UTF-8 and
# UTF-32 to let the installer decide on memory tradeoffs. ICU has an
# internal macro that defines its own internal representation, and that
# could conflict with our intended usage as well.
#
# Performance would suffer in the UTF-8 case, naturally, but the
# difference in memory usage might be significant enough that we'd want
to
# leave the decision up to the installer. Having said that, the headache
# of testing multiple versions of Perl6 might not be worth it.
#
# So, to wrap up, I'm soliciting thoughts on how best to start the
Unicode
# migration, and deal with the inevitable problems that will come up.
I'm
# hoping that most of the magic will be hidden in string.c, where we
won't
# have to worry about it, but we'll have to see.
#
# Now, this is admittedly being composed at 2:00 A.M, so my thoughts may
# not be the most coherent, and for that I apologize. Most of my concern
# stems from how best to add build steps to the various platforms
without
# ending up with a completely broken Parrot for weeks and developers
# screaming about "What the *HELL* is this error? Where is this library?
# <brane explodes>". If these issues have already been beaten to death
and
# we've moved on to more interesting issues, of course I'll be
interested
# there as well.

Overall you seem to be pretty on target.  Of course, my brain isn't
really built for character sets and stuff like that.

Also note that I went to bed at one, was rudely awakened by a screaming
toddler at two, didn't fall asleep again till four, and woke up at nine,
so I'm probably not very coherent.  I feel a little dizzy--I'm gonna
take a nap.

--Brent Dax <[EMAIL PROTECTED]>
@roles=map {"Parrot $_"} qw(embedding regexen Configure)

#define private public
    --Spotted in a C++ program just before a #include

Reply via email to