Jeff: # This will likely open yet another can of worms, but Unicode has been # delayed for too long, I think. It's time to add the Unicode libraries # (In our case, the ICU libraries at <http://oss.software.ibm.com/icu/>, # which Larry has now blessed) to Parrot. string.c already has # (admittedly # unavoidable, due to the library not being included) # assumptions such as # isdigit(). So, I have a few thoughts (that may have already been shot # down by people wiser than I in such matters) to explicate, and some # questions to ask. # # ICU should be added as a note in the README, and maybe to 'INSTALL' if # we ever create one. Let's not add it to CVS, as it's not under our # control. If we have to patch ICU to make it work correctly # with Parrot, # the patches should be submitted back to the ICU team. And I'm joining # the appropriate mailing lists to keep appraised of development.
I *really* strongly suggest we include ICU in the distribution. I recently had to turn off mod_ssl in the Apache 2 distro because I couldn't get OpenSSL downloaded and configured. We also need to make sure ICU will work everywhere. And I do mean *everywhere*. Will it work on VMS? Palm OS? Crays? # Before Unicode goes into full swing, I need some idea of how # we're going # to deploy the libraries. On this note, I defer to the # Configure master, # Brent. I've already done some work with ICU, so I'm reasonably # comfortable with migrating in one Unicode bit at a time, until we're # ready for full UTF-16 compliance. # # The RE engine should (I'm speaking without having recently read the # source, so feel free to correct me) not need to be migrated, as it's # already using UTF-32 internally, which leaves just the string # internals. # These can be migrated to using ICU macros fairly easily (I've already # done some of the work locally), so I think the main focus should be on # encodings, as we'll have to eventually support the more common # wide-character encodings such as KOI-8 and BIG5. There are a few things that need to change, but they aren't big issues. Mostly it's just places where character sets have been presumed. However, I'm seriously thinking about a major re-architecture of the regex engine, which would probably help these sorts of issues. # I still have some questions about using UTF-16 internally for string # representation (as mentioned in # <http:[EMAIL PROTECTED]/msg07856.html>), # but I've resolved most of those. It's an excellent match for the ICU # library, as it uses UTF-16 internally. My only question is if we're # going to incur a performance hit every time a scalar is transferred to # the RE engine, as it uses UTF-32 internally. That can change. However, utf32 seems like the best match, as it would allow us to reach into a string's guts for speed. (We don't currently do that, but if I do redesign the engine, I'll probably be able to.) # Also, once we have UTF-16 running internally, I'd be interested in # seeing what memory consumption looks like vs. UTF-32, beause I'd like to # see if it makes sense to add a compile-time switch between UTF-8 and # UTF-32 to let the installer decide on memory tradeoffs. ICU has an # internal macro that defines its own internal representation, and that # could conflict with our intended usage as well. # # Performance would suffer in the UTF-8 case, naturally, but the # difference in memory usage might be significant enough that we'd want to # leave the decision up to the installer. Having said that, the headache # of testing multiple versions of Perl6 might not be worth it. # # So, to wrap up, I'm soliciting thoughts on how best to start the Unicode # migration, and deal with the inevitable problems that will come up. I'm # hoping that most of the magic will be hidden in string.c, where we won't # have to worry about it, but we'll have to see. # # Now, this is admittedly being composed at 2:00 A.M, so my thoughts may # not be the most coherent, and for that I apologize. Most of my concern # stems from how best to add build steps to the various platforms without # ending up with a completely broken Parrot for weeks and developers # screaming about "What the *HELL* is this error? Where is this library? # <brane explodes>". If these issues have already been beaten to death and # we've moved on to more interesting issues, of course I'll be interested # there as well. Overall you seem to be pretty on target. Of course, my brain isn't really built for character sets and stuff like that. Also note that I went to bed at one, was rudely awakened by a screaming toddler at two, didn't fall asleep again till four, and woke up at nine, so I'm probably not very coherent. I feel a little dizzy--I'm gonna take a nap. --Brent Dax <[EMAIL PROTECTED]> @roles=map {"Parrot $_"} qw(embedding regexen Configure) #define private public --Spotted in a C++ program just before a #include