In regard to source data for perl encodings:
I have a lot of experience with encodings and encoding converters on a lot of
platforms, and compatibility between these converters can turn into a huge mess.
Because Perl can be used to create essentially permanent data repositories,
compatibility across platforms is a good thing.
This is why using one single source for code page conversions is important and why
using platform-specific conversions is a bad thing.
One could pick a set of encoding files from a lot of places, but it seems to me that
using the ICU data files has some key advantages: 1) there's someone to complain to
when they are wrong - and I'm sure there will be bugs found in them and 2) ICU's data
files are being used by real software so using them isn't quite so bleeding edge plus
3) they are under source control and have versioning so you can say 'this data was
created with ICU data files version N.M' - at least someday you could say that somehow.
If there are licensing issues, I think they can be resolved. If you try to contact the
ICU team about this and you don't get anywhere, please let me know and I'll try to
help.
I believe the best thing long term would be to use ICU for all conversions. Given
this, it makes sense to use the ICU data files in the short run so you can hope for
least controlled incompatibility.
And I would like to point out that although ICU's converter data file is largish, it
doesn't need to be. It isn't hard to trim it back or to even use separate table files.
Also, ICU doesn't load the whole table data file into memory at once or some silly
thing like that. It tries to be efficient.
So, to summarize, I think there is a need for built-in conversion tables at this point
in time and I believe they should be derived from ICU UCM files. Also, I believe that
only a small set of single-byte encodings should be built in. Multibyte and ISO-2022
related encodings - even only those used just on the Internet - are amazingly nasty
and best left to something like ICU. Even ICU doesn't get everything right, but
there's a decent process to fix things that are wrong.
In regard to the built-in converter engine, it needs these key features (assuming
single byte and Western/Central European languages plus Cyrillic only):
1) convert a single byte code point to Unicode.
2) convert several 'close cousins' from Unicode to a single character code. In other
words, it needs to convert Unicode N to x and Unicode M to x, etc. as separate
characters. The ICU UCM files have these alternate mappings in them.
3) allow Perl internals (at least) to specify what to do when you can't translate a
character from Unicode to the single byte encoding. This is important because you
might introduce interesting bugs/security holes when, for example, a question mark
gets splotched into your regular expression or, if you strip untranslatables, when the
match string becomes empty. This seems like a potentially big mess to me, but
hopefully it really isn't.
4) a non feature for this language group is converting Unicode combining sequences (or
simply multiple Unicode characters) to single characters in the code page (and vice
versa). This is required for some encodings, but not in this language group. (There
are cases where this would be nice, but this isn't a critical feature.)
Thanks and regards,
=Ed
------------------------------------------------------------
--== Sent via Deja.com ==--
http://www.deja.com/