Re: [Orgmode] Re: org-protocol: non-ASCII characters

2010-02-15 Thread Sebastian Rose
Jan Böcker jan.boec...@jboecker.de writes:

 On 12.02.2010 23:23, dmg wrote:

 For evince, I think I have found a problem in the parsing of the link.
 Evince already encodes
 the URL, but it does not encode the '/', hence you will get a link like this:

 emacsclient 
 'org-protocol://remember://docview/tmp/00%C3%A1%C3%A9%C3%AD%C3%B3%C3%BA.pdf::1'

 the filename is  /tmp/00áéíóú.pdf

 But emacs incorrectly stops parsing the link after tmp/

 I think I have found the proper way to handle this in evince.
 Check out the attached patch or pull from:

 git://github.com/jboecker/evince.git

 This code first retrieves the non-URI-encoded UTF-8 filename and passes
 that to uri_encode. Should g_file_get_path return NULL, we abort,
 because the URI specifies something in gnomes VFS layer that has no
 local path, so the link would not work, anyway.

 By the way, xournal now supports store-link

 Works as advertised, thanks!

 The only problem I have left now is a cosmetic one: when I store a link
 to, say, /tmp/test.xoj, in Org it becomes file://tmp/test.xoj instead of
 file:/tmp/test.xoj. (I have patched xournal and evince to generate file:
 instead of docview: links.)

 This is because org-protocol-sanitize-uri is called after decoding the
 string, allegedly because emacsclient compresses multiple slashes in a
 row to one. However, it seems that this function should be applied
 /before/ the string is URL-decoded. Is this a bug?


Hm - yes and no :)

I did not want to expose to much of the encoding and decoding problem to
the users. It's already complicated enough to add a bookmarklet.

`org-protocol-sanitize-uri' just works for the usual bookmarking and
remembering stuff we used it for - and everyone used it for `http:'
and similar protocols.



How about exending `org-protocol-sanitize-uri' to detect certain
protocols like `file:' and drop the extra slash for those? Or, better,
add an extra slash, which would be the correct way to express an
absolute path (though most apps on Linux these days take
`file:/one/slash/only').

org-protocol could be used for other purposes, too. Shouldn't Org-mode
follow links like [[file:///absolute/path]] and [[file://absolute/path]]
as we would expect? (OK, I know, emacsclient should be fixed...)





This is, what my browsers here do. They both do not care for the number
of slashes.

Opera 10 changes a correct URI to it's own special URI (note the
`localhost'):

 file://localhost/home


Firefox takes the `correct' URI:

  file:///home/sebastian



Here is a patch, that would fix it. We could add more exceptions to the
if-statement as needed.



diff --git a/lisp/org-protocol.el b/lisp/org-protocol.el
index 9881e9f..b80131c 100644
--- a/lisp/org-protocol.el
+++ b/lisp/org-protocol.el
@@ -267,8 +267,11 @@ Here is an example:
   emacsclient compresses double and tripple slashes.
 Slashes are sanitized to double slashes here.
   (when (string-match ^\\([a-z]+\\):/ uri)
-(let* ((splitparts (split-string uri /+)))
-  (setq uri (concat (car splitparts) // (mapconcat 'identity (cdr 
splitparts) /)
+(let* ((splitparts (split-string uri /+))
+   (extraslash //))
+  (if (string= file: (car splitparts))
+  (setq extraslash /))
+  (setq uri (concat (car splitparts) extraslash (mapconcat 'identity (cdr 
splitparts) /)
   uri)



Best wishes,


   Sebastian


___
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.
Emacs-orgmode@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-orgmode


Re: [Orgmode] Re: org-protocol: non-ASCII characters

2010-02-13 Thread Jan Böcker
On 12.02.2010 23:23, dmg wrote:

 For evince, I think I have found a problem in the parsing of the link.
 Evince already encodes
 the URL, but it does not encode the '/', hence you will get a link like this:
 
 emacsclient 
 'org-protocol://remember://docview/tmp/00%C3%A1%C3%A9%C3%AD%C3%B3%C3%BA.pdf::1'
 
 the filename is  /tmp/00áéíóú.pdf
 
 But emacs incorrectly stops parsing the link after tmp/

I think I have found the proper way to handle this in evince.
Check out the attached patch or pull from:

git://github.com/jboecker/evince.git

This code first retrieves the non-URI-encoded UTF-8 filename and passes
that to uri_encode. Should g_file_get_path return NULL, we abort,
because the URI specifies something in gnomes VFS layer that has no
local path, so the link would not work, anyway.

 By the way, xournal now supports store-link

Works as advertised, thanks!

The only problem I have left now is a cosmetic one: when I store a link
to, say, /tmp/test.xoj, in Org it becomes file://tmp/test.xoj instead of
file:/tmp/test.xoj. (I have patched xournal and evince to generate file:
instead of docview: links.)

This is because org-protocol-sanitize-uri is called after decoding the
string, allegedly because emacsclient compresses multiple slashes in a
row to one. However, it seems that this function should be applied
/before/ the string is URL-decoded. Is this a bug?

From f777bca64fd23066f626bc55cee6a81d6e03dac5 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Jan=20B=C3=B6cker?= jan.boec...@jboecker.de
Date: Sat, 13 Feb 2010 12:38:39 +0100
Subject: [PATCH 1/2] bugfix in encode_uri: cast to unsigned char to get the 
correct byte value

---
 libview/ev-view.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/libview/ev-view.c b/libview/ev-view.c
index c334fdc..1130d39 100644
--- a/libview/ev-view.c
+++ b/libview/ev-view.c
@@ -5775,8 +5775,8 @@ static void encode_uri(gchar *encoded_uri, gint bufsize, 
const gchar *uri)
   if (k + 4 = bufsize)
 break;
   encoded_uri[k++] = '%';
-  encoded_uri[k++] = hexa[uri[i] / 16];
-  encoded_uri[k++] = hexa[uri[i] % 16];
+  encoded_uri[k++] = hexa[(unsigned char)uri[i] / 16];
+  encoded_uri[k++] = hexa[(unsigned char)uri[i] % 16];
 }
   }
   encoded_uri[k] = 0;
-- 
1.6.6.1

From 1003e7809fbf2823e23b8dc8c7e3b46dfad0bcd5 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Jan=20B=C3=B6cker?= jan.boec...@jboecker.de
Date: Sat, 13 Feb 2010 12:37:31 +0100
Subject: [PATCH 2/2] URI-encode the utf-8 filename instead of a partially 
URI-encoded gnome vfs uri

---
 libview/ev-view.c |   28 
 1 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/libview/ev-view.c b/libview/ev-view.c
index 1130d39..4fda860 100644
--- a/libview/ev-view.c
+++ b/libview/ev-view.c
@@ -5800,9 +5800,18 @@ ev_view_annotate (EvView *ev_view, gchar *uri, int page)
 
EvDocumentInfo  *p = ev_document_get_info(ev_view-document);
 
+   // get the real file path from evince
+   GFile *gfile = g_file_new_for_uri(uri);
+   char *filePath = g_file_get_path(gfile);
+   g_object_unref (gfile);
+   if (!filePath) {
+   printf(invalid file path);
+   return;
+   }
+   
tempSel = g_malloc(ANN_MAX_BUFFER_LEN);
tempFileName = g_malloc(strlen(uri) * 4);
-
+   
if (!EV_IS_SELECTION (ev_view-document))  {
strcmp(tempSel,  ); 
text = ;
@@ -5811,20 +5820,13 @@ ev_view_annotate (EvView *ev_view, gchar *uri, int page)
text = get_selected_text (ev_view);
encode_uri(tempSel, ANN_MAX_BUFFER_LEN, text);
}
-   /// encode filename
-#define ANN_FILE_PREFIX file://
-   if (strncmp(uri,ANN_FILE_PREFIX, strlen(ANN_FILE_PREFIX) ) == 0) {
-   // skip the prefix
-   encode_uri(tempFileName, 
-  ANN_MAX_BUFFER_LEN, uri+strlen(ANN_FILE_PREFIX));
-   } else {
-   encode_uri(tempFileName, ANN_MAX_BUFFER_LEN, uri);
-   }
-
+   
+   encode_uri(tempFileName, ANN_MAX_BUFFER_LEN, filePath);
+   
tempCommandLine = g_malloc(strlen(tempSel) + strlen(tempFileName) + 
200);
 
-   printf(remember%s%s%s%d\n, p-title, uri, text, page);
sprintf(tempCommandLine, emacsclient 
'org-protocol://remember://docview:%s::%d', tempFileName, page+1);
+   printf(remember%s%s%s%d\n, p-title, filePath, text, 
page);
printf(temp: [%s]\n, tempCommandLine);
 
if (!g_spawn_command_line_async (tempCommandLine, error)) {
@@ -5836,6 +5838,8 @@ ev_view_annotate (EvView *ev_view, gchar *uri, int page)
g_free (tempSel);
g_free (tempCommandLine);
g_free (tempFileName);
+   g_free (filePath);
+   
 
 
 #ifdef fork
-- 
1.6.6.1

___
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.

Re: [Orgmode] Re: org-protocol: non-ASCII characters

2010-02-12 Thread dmg
 Basically, it is OK to url-encode each character who's binary
 representation start with 1 (i.e., the value of the character is higher
 than 127). The text to be url-encoded should be UTF-8 ideally.

 If you use glib::ustring, it's easy to transform any iso-8859 string to
 utf-8. Each character, whos binary representation start with a 1, has to
 be url-encoded as well as the `%' character [1], but you could as
 url-encode the entire utf-8 string.



Ok, I think I understand the problem now. I have updated xournal to encode the
filename from its encoding to uft8. that seems to work. See

http://github.com/dmgerman/xournal

For evince, I think I have found a problem in the parsing of the link.
Evince already encodes
the URL, but it does not encode the '/', hence you will get a link like this:

emacsclient 
'org-protocol://remember://docview/tmp/00%C3%A1%C3%A9%C3%AD%C3%B3%C3%BA.pdf::1'

the filename is  /tmp/00áéíóú.pdf

But emacs incorrectly stops parsing the link after tmp/

By the way, xournal now supports store-link


--dmg





 The function that does the decoding is `org-protocol-unhex-string' which
 in turn uses `org-protocol-unhex-compound'.


 `man utf-8` shows, how org-protocol tries to decode characters.


 The JavaScript-Funktion `encodeURIComponent()' returns exactly what we
 need. It recodes a string to utf-8 and then encodes all characters,
 except digits, ASCII letters and these punctuation characters: -_.!~*'()

 See ECMA-262 Standard, Section 15.1.3
 (http://bclary.com/2004/11/07/ecma-262.html#a-15.1.3 [2]):

   The character is first transformed into a sequence of octets using
    the UTF-8 transformation...


 Again, note, that the decoding mechanism relies on the fact, that the
 sequence to decode is url-encoded UTF-8.





 Example:

  The url-encoded unicode representation of the German umlaut `ö' is
  `%C3%B6'. Thus

     (org-protocol-unhex-string %C3%B6)

  gives you ö.

  In iso-8859-1, the url-encoded representation of the same character `ö' was
  `%F6'. But

     (org-protocol-unhex-string %F6)

  gives you  - the empty string. There is no utf-8 character with this binary
  representation, since every byte starting with a 1 (i.e. is bigger than 127)
  starts a multibyte sequence (2 or more bytes).

  But:

     (org-protocol-unhex-string %2F%3C)

  gives you, as expected,  / which shows, that you could savely
  url-encode each and every character of a utf-8 encoded string.


 ==  Footnotes:

 [1] The percent character `%' has to be encoded, if followed by
    [0-9A-Fa-f]{2}, because org-protocol will assume, that a sequence
    matching \\(%[0-9a-f][0-9a-f]\\)+ is an encoded character. That
    said, a `%' has to be url-encoded, since one will hardly ever
    know for sure, that a `%' is never followed by [0-9a-f][0-9a-f].

 [2] Get a PDF version of ECMA-262 third edition here:
    http://www.ecma-international.org/publications/standards/Ecma-262.htm





-- 
--dmg

---
Daniel M. German
http://turingmachine.org


___
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.
Emacs-orgmode@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-orgmode


Re: [Orgmode] Re: org-protocol: non-ASCII characters

2010-02-08 Thread Sebastian Rose
Jan Böcker jan.boec...@jboecker.de writes:
 On 06.02.2010 14:50, Jan Böcker wrote:
 AFAIK, your current approach is correct.

 I was wrong. The attached patch fixes a bug in the encode_uri function.
 That fixes the non-ASCII characters problem in xournal for me.

 The gchar type is just typedef'd to char, which means it is signed. To
 get the byte value, it must be cast to unsigned int first.

 - Jan


Hi Jan and Daniel!



Sorry for answering with that long delay. I read Daniel's mail last
week, but I had to think about the answer.


I'll just describe, what the `org-protocol-unhex-string' functions do
here, and what they expect as arguments.




Basically, it is OK to url-encode each character who's binary
representation start with 1 (i.e., the value of the character is higher
than 127). The text to be url-encoded should be UTF-8 ideally.

If you use glib::ustring, it's easy to transform any iso-8859 string to
utf-8. Each character, whos binary representation start with a 1, has to
be url-encoded as well as the `%' character [1], but you could as
url-encode the entire utf-8 string.






The function that does the decoding is `org-protocol-unhex-string' which
in turn uses `org-protocol-unhex-compound'.


`man utf-8` shows, how org-protocol tries to decode characters.


The JavaScript-Funktion `encodeURIComponent()' returns exactly what we
need. It recodes a string to utf-8 and then encodes all characters,
except digits, ASCII letters and these punctuation characters: -_.!~*'()

See ECMA-262 Standard, Section 15.1.3
(http://bclary.com/2004/11/07/ecma-262.html#a-15.1.3 [2]):

   The character is first transformed into a sequence of octets using
the UTF-8 transformation...


Again, note, that the decoding mechanism relies on the fact, that the
sequence to decode is url-encoded UTF-8.





Example:

  The url-encoded unicode representation of the German umlaut `ö' is
  `%C3%B6'. Thus

 (org-protocol-unhex-string %C3%B6)

  gives you ö.

  In iso-8859-1, the url-encoded representation of the same character `ö' was
  `%F6'. But

 (org-protocol-unhex-string %F6)

  gives you  - the empty string. There is no utf-8 character with this binary
  representation, since every byte starting with a 1 (i.e. is bigger than 127)
  starts a multibyte sequence (2 or more bytes).

  But:

 (org-protocol-unhex-string %2F%3C)

  gives you, as expected,  / which shows, that you could savely
  url-encode each and every character of a utf-8 encoded string.


==  Footnotes:

[1] The percent character `%' has to be encoded, if followed by
[0-9A-Fa-f]{2}, because org-protocol will assume, that a sequence
matching \\(%[0-9a-f][0-9a-f]\\)+ is an encoded character. That
said, a `%' has to be url-encoded, since one will hardly ever
know for sure, that a `%' is never followed by [0-9a-f][0-9a-f].

[2] Get a PDF version of ECMA-262 third edition here:
http://www.ecma-international.org/publications/standards/Ecma-262.htm


___
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.
Emacs-orgmode@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-orgmode


Re: [Orgmode] Re: org-protocol: non-ASCII characters

2010-02-06 Thread Jan Böcker
On 06.02.2010 14:50, Jan Böcker wrote:
 AFAIK, your current approach is correct.

I was wrong. The attached patch fixes a bug in the encode_uri function.
That fixes the non-ASCII characters problem in xournal for me.

The gchar type is just typedef'd to char, which means it is signed. To
get the byte value, it must be cast to unsigned int first.

- Jan
diff --git a/src/xo-misc.c b/src/xo-misc.c
index 6f0528c..c2582c7 100644
--- a/src/xo-misc.c
+++ b/src/xo-misc.c
@@ -2441,8 +2441,8 @@ void encode_uri(gchar *encoded_uri, gint bufsize, const 
gchar *uri)
   if (k + 4 = bufsize)
 break;
   encoded_uri[k++] = '%';
-  encoded_uri[k++] = hexa[uri[i] / 16];
-  encoded_uri[k++] = hexa[uri[i] % 16];
+  encoded_uri[k++] = hexa[(unsigned char)uri[i] / 16];
+  encoded_uri[k++] = hexa[(unsigned char)uri[i] % 16];
 }
   }
   encoded_uri[k] = 0;
___
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.
Emacs-orgmode@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-orgmode