Re: Emacs puts binary junk into the clipboard, marking it as text
Kenichi Handa wrote: This will be useful for checking UTF-8 validity. (define-ccl-program ccl-check-utf-8 '(0 ((r0 = 1) (loop (read-if (r1 #x80) (repeat) ((r0 = 0) (if (r1 #xC2) (end)) (read r2) (if ((r2 #xC0) != #x80) (end)) (if (r1 #xE0) ((r0 = 1) (repeat))) (read r2) (if ((r2 #xC0) != #x80) (end)) (if (r1 #xF0) ((r0 = 1) (repeat))) (read r2) (if ((r2 #xC0) != #x80) (end)) (if (r1 #xF8) ((r0 = 1) (repeat))) (read r2) (if ((r2 #xC0) != #x80) (end)) (if (r1 == #xF8) ((r0 = 1) (repeat))) (end)) Check if the input unibyte string is a valid UTF-8 sequence or not. If it is valid, set the register `r0' to 1, else set it to 0.) (defun string-utf-8-p (string) Return non-nil iff STRING is a unibyte string of valid UTF-8 sequence. (if (or (not (stringp string)) (multibyte-string-p string)) (error Not a unibyte string: %s string)) (let ((status (make-vector 9 0))) (ccl-execute-on-string ccl-check-utf-8 status string) (= (aref status 0) 1))) Thanks. I used them to check for UTF-8. We now decline selection requests for UTF8_STRING if the data is not in UTF-8. Jan D. ___ emacs-pretest-bug mailing list emacs-pretest-bug@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-pretest-bug
Re: Emacs puts binary junk into the clipboard, marking it as text
Kenichi Handa skrev: In article [EMAIL PROTECTED], Jan D. [EMAIL PROTECTED] writes: I've checked in a fix that changes UTF8_STRING to STRING if the data doesn't look like UTF8. However, this might give errors too. The only way to be sure to copy raw binary data correctly is by adding a new type (like application-specific/octet-stream). But if we do that, nobody will be able to get data from Emacs, as such a type is not standard and unsupported. Copy-paste with raw binary data is just something most apps don't do. AFAIK, only when TEXT is requested, an selection owner can choose the returning type from STRING, COMPOUND_TEXT, or UTF8_STRING. When UTF8_STRING is requested, we should return it or return nothing. And, if Emacs owns a unibyte string, perhaps the right thing is to make it multibyte according to the current lang. env. (by string-make-multibyte) at first, then encode it by utf-8. What would that do to illegal UTF-8 sequences in the original unibyte string? I.e. will this procedure always produce valid UTF-8 data? Jan D. ___ emacs-pretest-bug mailing list emacs-pretest-bug@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-pretest-bug
Re: Emacs puts binary junk into the clipboard, marking it as text
In article [EMAIL PROTECTED], Jan Djärv [EMAIL PROTECTED] writes: AFAIK, only when TEXT is requested, an selection owner can choose the returning type from STRING, COMPOUND_TEXT, or UTF8_STRING. When UTF8_STRING is requested, we should return it or return nothing. And, if Emacs owns a unibyte string, perhaps the right thing is to make it multibyte according to the current lang. env. (by string-make-multibyte) at first, then encode it by utf-8. What would that do to illegal UTF-8 sequences in the original unibyte string? The original unibyte string won't be in UTF-8 format. But, string-make-multibyte will convert it to a correct multibyte string, thus encoding that multibyte string by UTF-8 will produce a correct UTF-8 string ... usually. I.e. will this procedure always produce valid UTF-8 data? No. If a byte in the original unibyte string is not a valid code point of the primary charset of the current lang. env., string-make-unibyte will produce a multibyte string that contains eight-bit-control or eight-bit-graphic character. Then, encoding it by UTF-8 will results in incorrect UTF-8 sequence. So, for safely, we must delete such eight-bit characters or replace them with U+FFFD (REPLACEMENT CHARACTER) before encoding by UTF-8. Or, in such a case, don't return anything (which means Emacs doesn't hold a requested data). --- Kenichi Handa [EMAIL PROTECTED] ___ emacs-pretest-bug mailing list emacs-pretest-bug@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-pretest-bug
Re: Emacs puts binary junk into the clipboard, marking it as text
I've checked in a fix that changes UTF8_STRING to STRING if the data doesn't look like UTF8. However, this might give errors too. The only way to be sure to copy raw binary data correctly is by adding a new type (like application-specific/octet-stream). But if we do that, nobody will be able to get data from Emacs, as such a type is not standard and unsupported. Copy-paste with raw binary data is just something most apps don't do. AFAIK, only when TEXT is requested, an selection owner can choose the returning type from STRING, COMPOUND_TEXT, or UTF8_STRING. When UTF8_STRING is requested, we should return it or return nothing. Also IIRC a perfectly valid utf-8 buffer may contain eight-bit-* chars, use to keep track of valid unicode chars that have no corresponding character in emacs-mule. So the presence of eight-bit-* chars does not imply that the utf-8 encoded form of the text will contain an invalid utf-8 byte sequence. And, if Emacs owns a unibyte string, perhaps the right thing is to make it multibyte according to the current lang. env. (by string-make-multibyte) at first, then encode it by utf-8. That sounds terribly fragile/buggy. Stefan ___ emacs-pretest-bug mailing list emacs-pretest-bug@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-pretest-bug
Re: Emacs puts binary junk into the clipboard, marking it as text
In article [EMAIL PROTECTED], Stefan Monnier [EMAIL PROTECTED] writes: AFAIK, only when TEXT is requested, an selection owner can choose the returning type from STRING, COMPOUND_TEXT, or UTF8_STRING. When UTF8_STRING is requested, we should return it or return nothing. Also IIRC a perfectly valid utf-8 buffer may contain eight-bit-* chars, use to keep track of valid unicode chars that have no corresponding character in emacs-mule. So the presence of eight-bit-* chars does not imply that the utf-8 encoded form of the text will contain an invalid utf-8 byte sequence. Yes, but such eight-bit-* chars can be detected by checking `untranslated-utf-8' property. And, if Emacs owns a unibyte string, perhaps the right thing is to make it multibyte according to the current lang. env. (by string-make-multibyte) at first, then encode it by utf-8. That sounds terribly fragile/buggy. Then, what do you think Emacs should do in such a case? --- Kenichi Handa [EMAIL PROTECTED] ___ emacs-pretest-bug mailing list emacs-pretest-bug@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-pretest-bug
Re: Emacs puts binary junk into the clipboard, marking it as text
Also IIRC a perfectly valid utf-8 buffer may contain eight-bit-* chars, use to keep track of valid unicode chars that have no corresponding character in emacs-mule. So the presence of eight-bit-* chars does not imply that the utf-8 encoded form of the text will contain an invalid utf-8 byte sequence. Yes, but such eight-bit-* chars can be detected by checking `untranslated-utf-8' property. Sure, but the current code doesn't do that. And, if Emacs owns a unibyte string, perhaps the right thing is to make it multibyte according to the current lang. env. (by string-make-multibyte) at first, then encode it by utf-8. That sounds terribly fragile/buggy. Then, what do you think Emacs should do in such a case? I think we can't know what should be done, so we should strive for simplicity and try to avoid losing information. I.e. just return the unibyte string as-is. Stefan ___ emacs-pretest-bug mailing list emacs-pretest-bug@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-pretest-bug
Re: Emacs puts binary junk into the clipboard, marking it as text
Stefan Monnier skrev: Also IIRC a perfectly valid utf-8 buffer may contain eight-bit-* chars, use to keep track of valid unicode chars that have no corresponding character in emacs-mule. So the presence of eight-bit-* chars does not imply that the utf-8 encoded form of the text will contain an invalid utf-8 byte sequence. Yes, but such eight-bit-* chars can be detected by checking `untranslated-utf-8' property. Sure, but the current code doesn't do that. And, if Emacs owns a unibyte string, perhaps the right thing is to make it multibyte according to the current lang. env. (by string-make-multibyte) at first, then encode it by utf-8. That sounds terribly fragile/buggy. Then, what do you think Emacs should do in such a case? I think we can't know what should be done, so we should strive for simplicity and try to avoid losing information. I.e. just return the unibyte string as-is. That was the problem the original report was about. Gtk+-applications print big warnings. And there is no agreed upon selection type that represents just bytes. W.r.t the standards, Emacs has two choices, return a valid UTF8-string or don't return anything at all. I'm beginning to think the second option is the best. Jan D. ___ emacs-pretest-bug mailing list emacs-pretest-bug@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-pretest-bug
Re: Emacs puts binary junk into the clipboard, marking it as text
In article [EMAIL PROTECTED], Stefan Monnier [EMAIL PROTECTED] writes: I think we can't know what should be done, so we should strive for simplicity and try to avoid losing information. I.e. just return the unibyte string as-is. Even if it doesn't conform to ICCCM? I'll attach the relevant part of that document. Jan D. [EMAIL PROTECTED] writes: W.r.t the standards, Emacs has two choices, return a valid UTF8-string or don't return anything at all. I'm beginning to think the second option is the best. This will be useful for checking UTF-8 validity. (define-ccl-program ccl-check-utf-8 '(0 ((r0 = 1) (loop (read-if (r1 #x80) (repeat) ((r0 = 0) (if (r1 #xC2) (end)) (read r2) (if ((r2 #xC0) != #x80) (end)) (if (r1 #xE0) ((r0 = 1) (repeat))) (read r2) (if ((r2 #xC0) != #x80) (end)) (if (r1 #xF0) ((r0 = 1) (repeat))) (read r2) (if ((r2 #xC0) != #x80) (end)) (if (r1 #xF8) ((r0 = 1) (repeat))) (read r2) (if ((r2 #xC0) != #x80) (end)) (if (r1 == #xF8) ((r0 = 1) (repeat))) (end)) Check if the input unibyte string is a valid UTF-8 sequence or not. If it is valid, set the register `r0' to 1, else set it to 0.) (defun string-utf-8-p (string) Return non-nil iff STRING is a unibyte string of valid UTF-8 sequence. (if (or (not (stringp string)) (multibyte-string-p string)) (error Not a unibyte string: %s string)) (let ((status (make-vector 9 0))) (ccl-execute-on-string ccl-check-utf-8 status string) (= (aref status 0) 1))) --- Kenichi Handa [EMAIL PROTECTED] Inter-Client Communication Conventions Manual Version 2.0.xf86.1 [...] 2.7. Use of Selection Properties The names of the properties used in selection data transfer are chosen by the requestor. The use of None property fields in ConvertSelection requests (which request the selection owner to choose a name) is not permitted by these conventions. The selection owner always chooses the type of the property in the selection data transfer. Some types have special semantics assigned by convention, and these are reviewed in the following sections. In all cases, a request for conversion to a target should return either a property of one of the types listed in the previous table for that target or a property of type INCR and then a property of one of the listed types. Certain selection properties may contain resource IDs. The selection owner should ensure that the resource is not destroyed and that its contents are not changed until after the selection transfer is complete. Requestors that rely on the existence or on the proper contents of a resource must operate on the resource (for example, by copying the con- tents of a pixmap) before deleting the selection property. The selection owner will return a list of zero or more items of the type indicated by the property type. In general, the number of items in the list will correspond to the number of disjoint parts of the selection. Some targets (for example, side-effect targets) will be of length zero irrespective of the number of disjoint selection parts. In the case of fixed-size items, the requestor may determine the number of items by the property size. Selection property types are listed in the table below. For variable-length items such as text, the separators are also listed. - Type Atom Format Separator - APPLE_PICT8 Self-sizing ATOM 32 Fixed-size ATOM_PAIR 32 Fixed-size BITMAP32 Fixed-size C_STRING 8 Zero COLORMAP 32 Fixed-size COMPOUND_TEXT 8 Zero DRAWABLE 32 Fixed-size INCR 32 Fixed-size INTEGER 32 Fixed-size PIXEL 32 Fixed-size PIXMAP32 Fixed-size SPAN 32 Fixed-size STRING8 Zero UTF8_STRING 8 Zero WINDOW32 Fixed-size - It is expected that this table will grow over time. 2.7.1. TEXT Properties In general, the encoding for the characters in a text string property is specified by its type. It is highly desirable for there to be a simple, invertible mapping between string property types and any character set names embedded within font names in any font naming standard adopted by the Con- sortium. The atom TEXT is a polymorphic target. Requesting conver- sion into TEXT will convert into whatever encoding is conve- nient for the owner. The encoding chosen will be indicated by the type of the property returned. TEXT is not defined as a type; it will never be the returned type from a selec- tion conversion request. If the requestor wants the owner to return the contents of the selection in a specific encoding, it
Re: Emacs puts binary junk into the clipboard, marking it as text
In article [EMAIL PROTECTED], Jan D. [EMAIL PROTECTED] writes: I've checked in a fix that changes UTF8_STRING to STRING if the data doesn't look like UTF8. However, this might give errors too. The only way to be sure to copy raw binary data correctly is by adding a new type (like application-specific/octet-stream). But if we do that, nobody will be able to get data from Emacs, as such a type is not standard and unsupported. Copy-paste with raw binary data is just something most apps don't do. AFAIK, only when TEXT is requested, an selection owner can choose the returning type from STRING, COMPOUND_TEXT, or UTF8_STRING. When UTF8_STRING is requested, we should return it or return nothing. And, if Emacs owns a unibyte string, perhaps the right thing is to make it multibyte according to the current lang. env. (by string-make-multibyte) at first, then encode it by utf-8. --- Kenichi Handa [EMAIL PROTECTED] ___ emacs-pretest-bug mailing list emacs-pretest-bug@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-pretest-bug
Re: Emacs puts binary junk into the clipboard, marking it as text
Kevin Rodgers wrote: Jan Djärv wrote: Chris Moore skrev: A very simple case which reproduces the bug: I made a 1-byte file containing just character 0300 (octal), copied that using Emacs, and clipman started printing its error message over and over again. I reported this bug firstly to the Xfce BTS: http://bugzilla.xfce.org/show_bug.cgi?id=1945 but they told me it was a gtk bug, so I raised the same bug in the GNOME tracker: http://bugzilla.gnome.org/show_bug.cgi?id=349856 and they tell me it's an Emacs bug, saying: Well, if emacs puts binary junk into a text property it is not gtk's fault. Look at gtk_selection_data_get_text(): We only try to convert something to utf8 if the sender claims that it is text... So I'm raising it here too! Isn't 0300 a valid unicode character? Yes, but it is not encoded as a single byte in UTF-8, it would be 2 bytes: o303 o200 (xC3 x80). But that is as it should be, UTF8_STRING says data is in UTF-8, so Emacs sends o303 o200. gtk_selection_data_get_text does not complain on that. Anyway, xfce should not loop like that, gtk_selection_data_get_text does not loop, it just prints one error message and returns. Anyway, when Emacs gets a selection request for the clipboard with type UTF8_STRING, it eventually ends up in xselect-convert-to-string. This function does: ((eq type 'UTF8_STRING) (setq str (encode-coding-string str 'utf-8))) As far as I can tell, it does not check if str is all text, it seems to return non-text unconverted. Should we check str first? And if it does contain non-text, what should Emacs send back as type? STRING, TEXT? Doesn't that all depend on buffer-file-coding-system and selection-coding-system being set correctly? Yes, but I kind of assumed that was the case. Anyway, I will fix this somehow, we should not be sending non-UTF8 as a UTF8_STRING. Jan D. ___ emacs-pretest-bug mailing list emacs-pretest-bug@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-pretest-bug
Re: Emacs puts binary junk into the clipboard, marking it as text
Jan D. wrote: Kevin Rodgers wrote: Jan Djärv wrote: Chris Moore skrev: A very simple case which reproduces the bug: I made a 1-byte file containing just character 0300 (octal), copied that using Emacs, and clipman started printing its error message over and over again. Anyway, I will fix this somehow, we should not be sending non-UTF8 as a UTF8_STRING. I've checked in a fix that changes UTF8_STRING to STRING if the data doesn't look like UTF8. However, this might give errors too. The only way to be sure to copy raw binary data correctly is by adding a new type (like application-specific/octet-stream). But if we do that, nobody will be able to get data from Emacs, as such a type is not standard and unsupported. Copy-paste with raw binary data is just something most apps don't do. Please try this, it is hopefully better. Jan D. ___ emacs-pretest-bug mailing list emacs-pretest-bug@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-pretest-bug
Re: Emacs puts binary junk into the clipboard, marking it as text
Chris Moore skrev: Please describe exactly what actions triggered the bug and the precise symptoms of the bug: I run the Xfce 4 desktop environment, along with the xfce4-clipman-plugin applet which collects clipboard entries and allows me to chose between them from a menu. I have x-select-enable-clipboard set to t in Emacs, so whenever I 'kill' regions of the buffer, they get sent to the clipboard. Occasionally the clipman applet will start consuming all available CPU. This happens when I copy certain binary characters. Seems the clipman gets stuck in a loop trying to convert illegal an illegal UTF8 string. A very simple case which reproduces the bug: I made a 1-byte file containing just character 0300 (octal), copied that using Emacs, and clipman started printing its error message over and over again. I reported this bug firstly to the Xfce BTS: http://bugzilla.xfce.org/show_bug.cgi?id=1945 but they told me it was a gtk bug, so I raised the same bug in the GNOME tracker: http://bugzilla.gnome.org/show_bug.cgi?id=349856 and they tell me it's an Emacs bug, saying: Well, if emacs puts binary junk into a text property it is not gtk's fault. Look at gtk_selection_data_get_text(): We only try to convert something to utf8 if the sender claims that it is text... So I'm raising it here too! Isn't 0300 a valid unicode character? Anyway, when Emacs gets a selection request for the clipboard with type UTF8_STRING, it eventually ends up in xselect-convert-to-string. This function does: ((eq type 'UTF8_STRING) (setq str (encode-coding-string str 'utf-8))) As far as I can tell, it does not check if str is all text, it seems to return non-text unconverted. Should we check str first? And if it does contain non-text, what should Emacs send back as type? STRING, TEXT? Jan D. ___ emacs-pretest-bug mailing list emacs-pretest-bug@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-pretest-bug
Re: Emacs puts binary junk into the clipboard, marking it as text
Jan Djärv wrote: Chris Moore skrev: Please describe exactly what actions triggered the bug and the precise symptoms of the bug: I run the Xfce 4 desktop environment, along with the xfce4-clipman-plugin applet which collects clipboard entries and allows me to chose between them from a menu. I have x-select-enable-clipboard set to t in Emacs, so whenever I 'kill' regions of the buffer, they get sent to the clipboard. Occasionally the clipman applet will start consuming all available CPU. This happens when I copy certain binary characters. Seems the clipman gets stuck in a loop trying to convert illegal an illegal UTF8 string. A very simple case which reproduces the bug: I made a 1-byte file containing just character 0300 (octal), copied that using Emacs, and clipman started printing its error message over and over again. I reported this bug firstly to the Xfce BTS: http://bugzilla.xfce.org/show_bug.cgi?id=1945 but they told me it was a gtk bug, so I raised the same bug in the GNOME tracker: http://bugzilla.gnome.org/show_bug.cgi?id=349856 and they tell me it's an Emacs bug, saying: Well, if emacs puts binary junk into a text property it is not gtk's fault. Look at gtk_selection_data_get_text(): We only try to convert something to utf8 if the sender claims that it is text... So I'm raising it here too! Isn't 0300 a valid unicode character? Yes, but it is not encoded as a single byte in UTF-8, it would be 2 bytes: o303 o200 (xC3 x80). Anyway, when Emacs gets a selection request for the clipboard with type UTF8_STRING, it eventually ends up in xselect-convert-to-string. This function does: ((eq type 'UTF8_STRING) (setq str (encode-coding-string str 'utf-8))) As far as I can tell, it does not check if str is all text, it seems to return non-text unconverted. Should we check str first? And if it does contain non-text, what should Emacs send back as type? STRING, TEXT? Doesn't that all depend on buffer-file-coding-system and selection-coding-system being set correctly? -- Kevin ___ emacs-pretest-bug mailing list emacs-pretest-bug@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-pretest-bug
Emacs puts binary junk into the clipboard, marking it as text
Please describe exactly what actions triggered the bug and the precise symptoms of the bug: I run the Xfce 4 desktop environment, along with the xfce4-clipman-plugin applet which collects clipboard entries and allows me to chose between them from a menu. I have x-select-enable-clipboard set to t in Emacs, so whenever I 'kill' regions of the buffer, they get sent to the clipboard. Occasionally the clipman applet will start consuming all available CPU. This happens when I copy certain binary characters. Seems the clipman gets stuck in a loop trying to convert illegal an illegal UTF8 string. A very simple case which reproduces the bug: I made a 1-byte file containing just character 0300 (octal), copied that using Emacs, and clipman started printing its error message over and over again. I reported this bug firstly to the Xfce BTS: http://bugzilla.xfce.org/show_bug.cgi?id=1945 but they told me it was a gtk bug, so I raised the same bug in the GNOME tracker: http://bugzilla.gnome.org/show_bug.cgi?id=349856 and they tell me it's an Emacs bug, saying: Well, if emacs puts binary junk into a text property it is not gtk's fault. Look at gtk_selection_data_get_text(): We only try to convert something to utf8 if the sender claims that it is text... So I'm raising it here too! If emacs crashed, and you have the emacs process in the gdb debugger, please include the output from the following gdb commands: `bt full' and `xbacktrace'. If you would like to further debug the crash, please read the file /usr/local/share/emacs/22.0.50/etc/DEBUG for instructions. In GNU Emacs 22.0.50.141 (i686-pc-linux-gnu, GTK+ Version 2.8.18) of 2006-08-07 on chrislap X server distributor `The X.Org Foundation', version 11.0.7000 configured using `configure '--with-gtk' '--with-xpm' '--with-jpeg' '--with-png' '--with-gif'' Important settings: value of $LC_ALL: nil value of $LC_COLLATE: nil value of $LC_CTYPE: nil value of $LC_MESSAGES: nil value of $LC_MONETARY: nil value of $LC_NUMERIC: nil value of $LC_TIME: nil value of $LANG: en_GB.UTF-8 locale-coding-system: utf-8 default-enable-multibyte-characters: t Major mode: J-Shell Minor modes in effect: show-paren-mode: t display-time-mode: t iswitchb-mode: t dynamic-completion-mode: t shell-dirtrack-mode: t tooltip-mode: t mouse-wheel-mode: t file-name-shadow-mode: t global-font-lock-mode: t font-lock-mode: t unify-8859-on-encoding-mode: t utf-translate-cjk-mode: t auto-compression-mode: t column-number-mode: t line-number-mode: t transient-mark-mode: t Recent input: T h e SPC f u n c t i o n s SPC w h i c h SPC a l l o w SPC y o u SPC t o SPC v i e w SPC r e c e n t SPC k e y s t r o k e s SPC h a v e SPC b e e n C-j h i d d e n SPC b y SPC j - s h e l l , SPC t o SPC p r o t e c t SPC p a s s w o r d s SPC e n t e r e d SPC i n SPC s h e l l SPC b u f f e r s . Recent messages: Writing upload.php...done Wrote upload.php (No changes need to be saved) Mark set Defining kbd macro... Mark activated Keyboard macro defined (Type e to repeat macro) [118 times] Quit Loading emacsbug...done ___ emacs-pretest-bug mailing list emacs-pretest-bug@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-pretest-bug