RE: Source data for perl encodings

Ed Batutis Wed, 10 Jan 2001 10:55:07 -0800
In regard to source data for perl encodings:

I have a lot of experience with encodings and encoding converters on a lot of 
platforms, and compatibility between these converters can turn into a huge mess. 
Because Perl can be used to create essentially permanent data repositories, 
compatibility across platforms is a good thing.

This is why using one single source for code page conversions is important and why 
using platform-specific conversions is a bad thing.

One could pick a set of encoding files from a lot of places, but it seems to me that 
using the ICU data files has some key advantages: 1) there's someone to complain to 
when they are wrong - and I'm sure there will be bugs found in them and 2) ICU's data 
files are being used by real software so using them isn't quite so bleeding edge plus 
3) they are under source control and have versioning so you can say 'this data was 
created with ICU data files version N.M' - at least someday you could say that somehow.

If there are licensing issues, I think they can be resolved. If you try to contact the 
ICU team about this and you don't get anywhere, please let me know and I'll try to 
help.

I believe the best thing long term would be to use ICU for all conversions. Given 
this, it makes sense to use the ICU data files in the short run so you can hope for 
least controlled incompatibility.

And I would like to point out that although ICU's converter data file is largish, it 
doesn't need to be. It isn't hard to trim it back or to even use separate table files. 
Also, ICU doesn't load the whole table data file into memory at once or some silly 
thing like that. It tries to be efficient.

So, to summarize, I think there is a need for built-in conversion tables at this point 
in time and I believe they should be derived from ICU UCM files. Also, I believe that 
only a small set of single-byte encodings should be built in. Multibyte and ISO-2022 
related encodings - even only those used just on the Internet - are amazingly nasty 
and best left to something like ICU. Even ICU doesn't get everything right, but 
there's a decent process to fix things that are wrong.

In regard to the built-in converter engine, it needs these key features (assuming 
single byte and Western/Central European languages plus Cyrillic only):

1) convert a single byte code point to Unicode.
2) convert several 'close cousins' from Unicode to a single character code. In other 
words, it needs to convert Unicode N to x and Unicode M to x, etc. as separate 
characters. The ICU UCM files have these alternate mappings in them. 
3) allow Perl internals (at least) to specify what to do when you can't translate a 
character from Unicode to the single byte encoding. This is important because you 
might introduce interesting bugs/security holes when, for example, a question mark 
gets splotched into your regular expression or, if you strip untranslatables, when the 
match string becomes empty. This seems like a potentially big mess to me, but 
hopefully it really isn't.
4) a non feature for this language group is converting Unicode combining sequences (or 
simply multiple Unicode characters) to single characters in the code page (and vice 
versa). This is required for some encodings, but not in this language group. (There 
are cases where this would be nice, but this isn't a critical feature.)

Thanks and regards,

=Ed




------------------------------------------------------------
--== Sent via Deja.com ==--
http://www.deja.com/
RE: Source data for perl encodings

Reply via email to