Re: [Orgmode] Re: org-protocol: non-ASCII characters
Jan Böcker jan.boec...@jboecker.de writes: On 12.02.2010 23:23, dmg wrote: For evince, I think I have found a problem in the parsing of the link. Evince already encodes the URL, but it does not encode the '/', hence you will get a link like this: emacsclient 'org-protocol://remember://docview/tmp/00%C3%A1%C3%A9%C3%AD%C3%B3%C3%BA.pdf::1' the filename is /tmp/00áéíóú.pdf But emacs incorrectly stops parsing the link after tmp/ I think I have found the proper way to handle this in evince. Check out the attached patch or pull from: git://github.com/jboecker/evince.git This code first retrieves the non-URI-encoded UTF-8 filename and passes that to uri_encode. Should g_file_get_path return NULL, we abort, because the URI specifies something in gnomes VFS layer that has no local path, so the link would not work, anyway. By the way, xournal now supports store-link Works as advertised, thanks! The only problem I have left now is a cosmetic one: when I store a link to, say, /tmp/test.xoj, in Org it becomes file://tmp/test.xoj instead of file:/tmp/test.xoj. (I have patched xournal and evince to generate file: instead of docview: links.) This is because org-protocol-sanitize-uri is called after decoding the string, allegedly because emacsclient compresses multiple slashes in a row to one. However, it seems that this function should be applied /before/ the string is URL-decoded. Is this a bug? Hm - yes and no :) I did not want to expose to much of the encoding and decoding problem to the users. It's already complicated enough to add a bookmarklet. `org-protocol-sanitize-uri' just works for the usual bookmarking and remembering stuff we used it for - and everyone used it for `http:' and similar protocols. How about exending `org-protocol-sanitize-uri' to detect certain protocols like `file:' and drop the extra slash for those? Or, better, add an extra slash, which would be the correct way to express an absolute path (though most apps on Linux these days take `file:/one/slash/only'). org-protocol could be used for other purposes, too. Shouldn't Org-mode follow links like [[file:///absolute/path]] and [[file://absolute/path]] as we would expect? (OK, I know, emacsclient should be fixed...) This is, what my browsers here do. They both do not care for the number of slashes. Opera 10 changes a correct URI to it's own special URI (note the `localhost'): file://localhost/home Firefox takes the `correct' URI: file:///home/sebastian Here is a patch, that would fix it. We could add more exceptions to the if-statement as needed. diff --git a/lisp/org-protocol.el b/lisp/org-protocol.el index 9881e9f..b80131c 100644 --- a/lisp/org-protocol.el +++ b/lisp/org-protocol.el @@ -267,8 +267,11 @@ Here is an example: emacsclient compresses double and tripple slashes. Slashes are sanitized to double slashes here. (when (string-match ^\\([a-z]+\\):/ uri) -(let* ((splitparts (split-string uri /+))) - (setq uri (concat (car splitparts) // (mapconcat 'identity (cdr splitparts) /) +(let* ((splitparts (split-string uri /+)) + (extraslash //)) + (if (string= file: (car splitparts)) + (setq extraslash /)) + (setq uri (concat (car splitparts) extraslash (mapconcat 'identity (cdr splitparts) /) uri) Best wishes, Sebastian ___ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode
Re: [Orgmode] Re: org-protocol: non-ASCII characters
On 12.02.2010 23:23, dmg wrote: For evince, I think I have found a problem in the parsing of the link. Evince already encodes the URL, but it does not encode the '/', hence you will get a link like this: emacsclient 'org-protocol://remember://docview/tmp/00%C3%A1%C3%A9%C3%AD%C3%B3%C3%BA.pdf::1' the filename is /tmp/00áéíóú.pdf But emacs incorrectly stops parsing the link after tmp/ I think I have found the proper way to handle this in evince. Check out the attached patch or pull from: git://github.com/jboecker/evince.git This code first retrieves the non-URI-encoded UTF-8 filename and passes that to uri_encode. Should g_file_get_path return NULL, we abort, because the URI specifies something in gnomes VFS layer that has no local path, so the link would not work, anyway. By the way, xournal now supports store-link Works as advertised, thanks! The only problem I have left now is a cosmetic one: when I store a link to, say, /tmp/test.xoj, in Org it becomes file://tmp/test.xoj instead of file:/tmp/test.xoj. (I have patched xournal and evince to generate file: instead of docview: links.) This is because org-protocol-sanitize-uri is called after decoding the string, allegedly because emacsclient compresses multiple slashes in a row to one. However, it seems that this function should be applied /before/ the string is URL-decoded. Is this a bug? From f777bca64fd23066f626bc55cee6a81d6e03dac5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jan=20B=C3=B6cker?= jan.boec...@jboecker.de Date: Sat, 13 Feb 2010 12:38:39 +0100 Subject: [PATCH 1/2] bugfix in encode_uri: cast to unsigned char to get the correct byte value --- libview/ev-view.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/libview/ev-view.c b/libview/ev-view.c index c334fdc..1130d39 100644 --- a/libview/ev-view.c +++ b/libview/ev-view.c @@ -5775,8 +5775,8 @@ static void encode_uri(gchar *encoded_uri, gint bufsize, const gchar *uri) if (k + 4 = bufsize) break; encoded_uri[k++] = '%'; - encoded_uri[k++] = hexa[uri[i] / 16]; - encoded_uri[k++] = hexa[uri[i] % 16]; + encoded_uri[k++] = hexa[(unsigned char)uri[i] / 16]; + encoded_uri[k++] = hexa[(unsigned char)uri[i] % 16]; } } encoded_uri[k] = 0; -- 1.6.6.1 From 1003e7809fbf2823e23b8dc8c7e3b46dfad0bcd5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jan=20B=C3=B6cker?= jan.boec...@jboecker.de Date: Sat, 13 Feb 2010 12:37:31 +0100 Subject: [PATCH 2/2] URI-encode the utf-8 filename instead of a partially URI-encoded gnome vfs uri --- libview/ev-view.c | 28 1 files changed, 16 insertions(+), 12 deletions(-) diff --git a/libview/ev-view.c b/libview/ev-view.c index 1130d39..4fda860 100644 --- a/libview/ev-view.c +++ b/libview/ev-view.c @@ -5800,9 +5800,18 @@ ev_view_annotate (EvView *ev_view, gchar *uri, int page) EvDocumentInfo *p = ev_document_get_info(ev_view-document); + // get the real file path from evince + GFile *gfile = g_file_new_for_uri(uri); + char *filePath = g_file_get_path(gfile); + g_object_unref (gfile); + if (!filePath) { + printf(invalid file path); + return; + } + tempSel = g_malloc(ANN_MAX_BUFFER_LEN); tempFileName = g_malloc(strlen(uri) * 4); - + if (!EV_IS_SELECTION (ev_view-document)) { strcmp(tempSel, ); text = ; @@ -5811,20 +5820,13 @@ ev_view_annotate (EvView *ev_view, gchar *uri, int page) text = get_selected_text (ev_view); encode_uri(tempSel, ANN_MAX_BUFFER_LEN, text); } - /// encode filename -#define ANN_FILE_PREFIX file:// - if (strncmp(uri,ANN_FILE_PREFIX, strlen(ANN_FILE_PREFIX) ) == 0) { - // skip the prefix - encode_uri(tempFileName, - ANN_MAX_BUFFER_LEN, uri+strlen(ANN_FILE_PREFIX)); - } else { - encode_uri(tempFileName, ANN_MAX_BUFFER_LEN, uri); - } - + + encode_uri(tempFileName, ANN_MAX_BUFFER_LEN, filePath); + tempCommandLine = g_malloc(strlen(tempSel) + strlen(tempFileName) + 200); - printf(remember%s%s%s%d\n, p-title, uri, text, page); sprintf(tempCommandLine, emacsclient 'org-protocol://remember://docview:%s::%d', tempFileName, page+1); + printf(remember%s%s%s%d\n, p-title, filePath, text, page); printf(temp: [%s]\n, tempCommandLine); if (!g_spawn_command_line_async (tempCommandLine, error)) { @@ -5836,6 +5838,8 @@ ev_view_annotate (EvView *ev_view, gchar *uri, int page) g_free (tempSel); g_free (tempCommandLine); g_free (tempFileName); + g_free (filePath); + #ifdef fork -- 1.6.6.1 ___ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list.
Re: [Orgmode] Re: org-protocol: non-ASCII characters
Basically, it is OK to url-encode each character who's binary representation start with 1 (i.e., the value of the character is higher than 127). The text to be url-encoded should be UTF-8 ideally. If you use glib::ustring, it's easy to transform any iso-8859 string to utf-8. Each character, whos binary representation start with a 1, has to be url-encoded as well as the `%' character [1], but you could as url-encode the entire utf-8 string. Ok, I think I understand the problem now. I have updated xournal to encode the filename from its encoding to uft8. that seems to work. See http://github.com/dmgerman/xournal For evince, I think I have found a problem in the parsing of the link. Evince already encodes the URL, but it does not encode the '/', hence you will get a link like this: emacsclient 'org-protocol://remember://docview/tmp/00%C3%A1%C3%A9%C3%AD%C3%B3%C3%BA.pdf::1' the filename is /tmp/00áéíóú.pdf But emacs incorrectly stops parsing the link after tmp/ By the way, xournal now supports store-link --dmg The function that does the decoding is `org-protocol-unhex-string' which in turn uses `org-protocol-unhex-compound'. `man utf-8` shows, how org-protocol tries to decode characters. The JavaScript-Funktion `encodeURIComponent()' returns exactly what we need. It recodes a string to utf-8 and then encodes all characters, except digits, ASCII letters and these punctuation characters: -_.!~*'() See ECMA-262 Standard, Section 15.1.3 (http://bclary.com/2004/11/07/ecma-262.html#a-15.1.3 [2]): The character is first transformed into a sequence of octets using the UTF-8 transformation... Again, note, that the decoding mechanism relies on the fact, that the sequence to decode is url-encoded UTF-8. Example: The url-encoded unicode representation of the German umlaut `ö' is `%C3%B6'. Thus (org-protocol-unhex-string %C3%B6) gives you ö. In iso-8859-1, the url-encoded representation of the same character `ö' was `%F6'. But (org-protocol-unhex-string %F6) gives you - the empty string. There is no utf-8 character with this binary representation, since every byte starting with a 1 (i.e. is bigger than 127) starts a multibyte sequence (2 or more bytes). But: (org-protocol-unhex-string %2F%3C) gives you, as expected, / which shows, that you could savely url-encode each and every character of a utf-8 encoded string. == Footnotes: [1] The percent character `%' has to be encoded, if followed by [0-9A-Fa-f]{2}, because org-protocol will assume, that a sequence matching \\(%[0-9a-f][0-9a-f]\\)+ is an encoded character. That said, a `%' has to be url-encoded, since one will hardly ever know for sure, that a `%' is never followed by [0-9a-f][0-9a-f]. [2] Get a PDF version of ECMA-262 third edition here: http://www.ecma-international.org/publications/standards/Ecma-262.htm -- --dmg --- Daniel M. German http://turingmachine.org ___ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode
Re: [Orgmode] Re: org-protocol: non-ASCII characters
Jan Böcker jan.boec...@jboecker.de writes: On 06.02.2010 14:50, Jan Böcker wrote: AFAIK, your current approach is correct. I was wrong. The attached patch fixes a bug in the encode_uri function. That fixes the non-ASCII characters problem in xournal for me. The gchar type is just typedef'd to char, which means it is signed. To get the byte value, it must be cast to unsigned int first. - Jan Hi Jan and Daniel! Sorry for answering with that long delay. I read Daniel's mail last week, but I had to think about the answer. I'll just describe, what the `org-protocol-unhex-string' functions do here, and what they expect as arguments. Basically, it is OK to url-encode each character who's binary representation start with 1 (i.e., the value of the character is higher than 127). The text to be url-encoded should be UTF-8 ideally. If you use glib::ustring, it's easy to transform any iso-8859 string to utf-8. Each character, whos binary representation start with a 1, has to be url-encoded as well as the `%' character [1], but you could as url-encode the entire utf-8 string. The function that does the decoding is `org-protocol-unhex-string' which in turn uses `org-protocol-unhex-compound'. `man utf-8` shows, how org-protocol tries to decode characters. The JavaScript-Funktion `encodeURIComponent()' returns exactly what we need. It recodes a string to utf-8 and then encodes all characters, except digits, ASCII letters and these punctuation characters: -_.!~*'() See ECMA-262 Standard, Section 15.1.3 (http://bclary.com/2004/11/07/ecma-262.html#a-15.1.3 [2]): The character is first transformed into a sequence of octets using the UTF-8 transformation... Again, note, that the decoding mechanism relies on the fact, that the sequence to decode is url-encoded UTF-8. Example: The url-encoded unicode representation of the German umlaut `ö' is `%C3%B6'. Thus (org-protocol-unhex-string %C3%B6) gives you ö. In iso-8859-1, the url-encoded representation of the same character `ö' was `%F6'. But (org-protocol-unhex-string %F6) gives you - the empty string. There is no utf-8 character with this binary representation, since every byte starting with a 1 (i.e. is bigger than 127) starts a multibyte sequence (2 or more bytes). But: (org-protocol-unhex-string %2F%3C) gives you, as expected, / which shows, that you could savely url-encode each and every character of a utf-8 encoded string. == Footnotes: [1] The percent character `%' has to be encoded, if followed by [0-9A-Fa-f]{2}, because org-protocol will assume, that a sequence matching \\(%[0-9a-f][0-9a-f]\\)+ is an encoded character. That said, a `%' has to be url-encoded, since one will hardly ever know for sure, that a `%' is never followed by [0-9a-f][0-9a-f]. [2] Get a PDF version of ECMA-262 third edition here: http://www.ecma-international.org/publications/standards/Ecma-262.htm ___ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode
Re: [Orgmode] Re: org-protocol: non-ASCII characters
On 06.02.2010 14:50, Jan Böcker wrote: AFAIK, your current approach is correct. I was wrong. The attached patch fixes a bug in the encode_uri function. That fixes the non-ASCII characters problem in xournal for me. The gchar type is just typedef'd to char, which means it is signed. To get the byte value, it must be cast to unsigned int first. - Jan diff --git a/src/xo-misc.c b/src/xo-misc.c index 6f0528c..c2582c7 100644 --- a/src/xo-misc.c +++ b/src/xo-misc.c @@ -2441,8 +2441,8 @@ void encode_uri(gchar *encoded_uri, gint bufsize, const gchar *uri) if (k + 4 = bufsize) break; encoded_uri[k++] = '%'; - encoded_uri[k++] = hexa[uri[i] / 16]; - encoded_uri[k++] = hexa[uri[i] % 16]; + encoded_uri[k++] = hexa[(unsigned char)uri[i] / 16]; + encoded_uri[k++] = hexa[(unsigned char)uri[i] % 16]; } } encoded_uri[k] = 0; ___ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode