Hi, > the html might bring also bad encoding in the head, but I > guess it is also due to the fact that source-highlight reads > two bytes, which in unicode represent a single character, > and interprets them as two characters instead of one. > This is unicode, am I right? Sorry for my ignorance, > but with unicode in a text file every character is > represented by two bytes, right?
Nope. There is not one standard Unicode encoding, but several. The most common one is UTF-8, which is a variable length encoding where each Unicode character can take from 1 to 4 bytes (originally it was up to 6, but that's deprecated now). Another variable-length encoding is UTF-16, where each character can occupy between 2 and 4 bytes. The only fixed-length encoding is UTF-32 (UCS-4), where each character requires 4 bytes. > I'd like to try with wstring and see whether this solves > something. I haven't used C++ in a long time, but isn't wstring based on wchar_t, which is 2 bytes long? If so, it won't solve anything. There is no Unicode encoding that uses a fixed-length of 2 bytes! Lorenzo, I think we can give you a hand in implementing this. However, if you read through this entire thread you will notice that the best course of action is dependent on a crucial piece of information which you are the most qualified person to provide: we need a list of the manipulations that Source-highlight applies to strings. Hope that helps! Best regards, Dario Teixeira _______________________________________________ Help-source-highlight mailing list [email protected] http://lists.gnu.org/mailman/listinfo/help-source-highlight
