I think the separation of charsets and encodings in the string code doesn't make sense. The way I see it, the only charset that's used in Parrot is Unicode. ASCII and ISO-8859-1 are subsets of Unicode, so they could be treated like the other UTF and UCS encodings.

Currently, you have to use trans_charset (to_charset in C) to convert a string to ISO-8859-1 but you have to use trans_encoding (to_encoding) to convert to UTF16. That looks arbitrary and confusing to me. The encoding:charset combinations right now are:

- fixed8:ascii
- fixed8:iso-8859-1
- fixed8:binary
- utf8:unicode
- utf16:unicode
- ucs2:unicode
- ucs4:unicode

My proposal is to merge all the charset and encoding functions into a single kind of string vtable eliminating duplicates like hash and find_cclass. I would keep the name "encoding", so there would be seven encodings:

- ascii
- iso-8859-1
- binary
- utf8
- utf16
- ucs2
- ucs4

The fixed8 and unicode encodings would still share many of their functions but it would be much easier to add specialisations. The string code would be simplified and the charset pointer in the string header could be removed.

Then the charset opcodes "charset", "charsetname", "find_charset", and "trans_charset" could go away. We can also keep them for a while and map them to the encoding opcodes for backwards compatibility.

We can also keep the encoding:charset:"string" syntax for string literals and simply try to lookup both the encoding and charset for full backwards compatibility.

Nick
_______________________________________________
http://lists.parrot.org/mailman/listinfo/parrot-dev

Reply via email to