[RFC] Merge charsets and encodings

Nick Wellnhofer Wed, 25 Aug 2010 05:49:24 -0700

I think the separation of charsets and encodings in the string codedoesn't make sense. The way I see it, the only charset that's used inParrot is Unicode. ASCII and ISO-8859-1 are subsets of Unicode, so theycould be treated like the other UTF and UCS encodings.

Currently, you have to use trans_charset (to_charset in C) to convert astring to ISO-8859-1 but you have to use trans_encoding (to_encoding) toconvert to UTF16. That looks arbitrary and confusing to me. Theencoding:charset combinations right now are:


- fixed8:ascii
- fixed8:iso-8859-1
- fixed8:binary
- utf8:unicode
- utf16:unicode
- ucs2:unicode
- ucs4:unicode

My proposal is to merge all the charset and encoding functions into asingle kind of string vtable eliminating duplicates like hash andfind_cclass. I would keep the name "encoding", so there would be sevenencodings:


- ascii
- iso-8859-1
- binary
- utf8
- utf16
- ucs2
- ucs4

The fixed8 and unicode encodings would still share many of theirfunctions but it would be much easier to add specialisations. The stringcode would be simplified and the charset pointer in the string headercould be removed.

Then the charset opcodes "charset", "charsetname", "find_charset", and"trans_charset" could go away. We can also keep them for a while and mapthem to the encoding opcodes for backwards compatibility.

We can also keep the encoding:charset:"string" syntax for stringliterals and simply try to lookup both the encoding and charset for fullbackwards compatibility.


Nick
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev

[RFC] Merge charsets and encodings

Reply via email to