If I understand (info "(elisp)Converting Representations") correctly, Emacs will convert unibyte text to multibyte if it is inserted into a multibyte buffer. However, on Windows I could observe that text, guillemets in particular, copied from the character table and pasted into Emacs will remain in its unibyte representation. When typing `C-u C-x =' on a « character one gets the following result with a CVS Emacs checked out and compiled a few days ago:
,---- | character: « (0253, 171, 0xab) | charset: eight-bit-graphic (8-bit graphic char (0xA0..0xFF)) | code point: 171 | syntax: which means: whitespace | buffer code: 0xAB | file code: 0xAB (encoded by coding system raw-text-dos) | display: by display table entry [?«] (see below) | | The display table entry is displayed by these fonts (glyph codes): | «: -raster-Courier-normal-r-normal-normal-20-120-120-120-c-120-iso8859-1 (0xAB) | | There are text properties here: | fontified t `---- I would have expected to see the multibyte representation: ,---- | character: « (04253, 2219, 0x8ab, U+00AB) | charset: latin-iso8859-1 | (Right-Hand Part of Latin Alphabet 1 (ISO/IEC 8859-1): ISO-IR-100.) | code point: 43 | syntax: . which means: punctuation | category: l:Latin | buffer code: 0x81 0xAB | file code: 0xC2 0xAB (encoded by coding system mule-utf-8-dos) | display: by this font (glyph code) | -raster-Courier-normal-r-normal-normal-20-120-120-120-c-120-iso8859-1 (0xAB) | | There are text properties here: | face [font-latex-string-face] | fontified t `---- (This was the result of pasting into a UTF-8 buffer.) I am not sure if this is a bug, a user mistake, or something else. On GNU/Linux I can simulate the problem by typing `M-: (insert 171) RET' in a Latin-1 buffer. Now my problem is, that I have to compare the guillemet found in the buffer with another one in Lisp code in order to find the matching closing one for font locking. `re-search-forward' obviously finds the opening guillement in its unibyte form, but then comparing it with a multibyte guillemet fails. (What happens is probably something like `(string= (string 171) (string 2219))'.) So I am wondering if the unibyte strings should not be present in the buffer in the first place[1] or if I have to explicitely convert the unibyte strings to multibyte (e.g. with `string-make-multibyte'). Footnotes: [1] Such strings are, BTW, a nice way to shoot yourself in the foot: (progn (find-file "foo.txt") (insert 171 "foo" 187 "\n") (set-buffer-file-coding-system 'mule-utf-8) (save-buffer) (kill-buffer (current-buffer)) (find-file "foo.txt") (insert 171 "bar" 187 "\n") (set-buffer-file-coding-system 'mule-utf-8) (save-buffer) (kill-buffer (current-buffer)) (find-file "foo.txt")) -- Ralf _______________________________________________ Help-gnu-emacs mailing list Help-gnu-emacs@gnu.org http://lists.gnu.org/mailman/listinfo/help-gnu-emacs