* Tobias Kremer <[EMAIL PROTECTED]> [2007-08-10 12:41]: > Zitat von Tatsuhiko Miyagawa <[EMAIL PROTECTED]>: > >Concatinating utf-8 flagged variables with utf-8 encoded byte > >string causes automatic SV upgrade, which causes double utf-8 > >encoded string. > > Hmmm. So my templates are utf8 _ENCODED_ and the strings coming > in from other perl modules are just utf8 _FLAGGED_. When TT > concats them together during process() the result is wrecked > because of the automatic upgrade. Correct?
Forget the fact that they are UTF-8 flagged. Think of it this way: Perl has two kinds of strings, byte strings and character strings. Byte strings consist of, well, bytes; they might be text, or maybe they’re not. If they are, they are _encoded_; to understand the text you have to _decode_ the byte sequence to characters. This notion may seem weird if you haven’t dealt with Unicode in depth, because most character sets use 255 characters, which they just represent using a single byte. But if you have more than 255 characters (and Unicode has a lot more), then suddenly you have to pick some way to represent the character codes. A sequence of bytes alone is meaningless as text until you know what encoding it’s in. Character strings, OTOH, consist of Unicode characters; pure, ideal, atomic characters that have no particular representation. Of course the interpreter has to store these ideal characters somehow, so it uses UTF-8 internally; but that could equally well be UTF-16 or UCS-4 or for that matter ASCII plus XML entities. For deeper exposition of the concepts (what is an ideal character and how does it relate to encodings), read Joel Spolsky’s classic article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) http://www.joelonsoftware.com/articles/Unicode.html Anyway, the problem you are seeing is that as long as you stay in one realm, things will work. F.ex., if you mix byte strings, and the bytes represent text encoded with the same encoding in both strings, you can mix them just fine. Note though that with multibyte or variable-width encodings (eg. UCS-2 and UTF-8 respectively), you will have to be careful to take the encoding into account in every string mutation. F.ex. if you truncate post titles for display in a sidebar, you will have to manually take care not to cut off the string off in the middle of a three-byte character. Likewise, the strings are both character strings, then you can mix them no problem. And because they consist of pure ideal characters, any operations on them treat characters as atomic. You do not need to care whether a character is one, two, three or however many bytes in the internal representation used by Perl; you can just truncate strings or run substitutions on them etc without worrying. But if you mix byte strings and character strings, there is trouble. Perl must find out what characters are in the byte string, so it must decode it. By default it does so by assuming that byte strings are text encoded in ISO-8859-1. If this is the wrong encoding, because, say, your data was actually UTF-8-encoded – well, oops: now you have UTF-8 that was decoded as ISO-8859-1, which leads to the well-known artifacts. Note, however, that you can change the default using the `encoding` pragma. See `perldoc encoding`. If the program code itself is in UTF-8, you may want to declare that also: see `perldoc utf8`. And finally – see `perldoc perlunicode`. Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/> _______________________________________________ List: Catalyst@lists.rawmode.org Listinfo: http://lists.rawmode.org/mailman/listinfo/catalyst Searchable archive: http://www.mail-archive.com/catalyst@lists.rawmode.org/ Dev site: http://dev.catalyst.perl.org/