Re: [RCD] URLs with 8bit chars?

Rimas Kudelis Sat, 22 Feb 2014 06:49:21 -0800

2014.02.22 14:35, Thomas Bruederli rašė:

On Mon, Feb 17, 2014 at 11:54 PM, Reindl Harald <[email protected]> wrote:

Roundcube does not fully recognize URLs with 8bit chars, they are being
truncated upon the first occurrence of any such 8 bit char

where does roundcube need to recognize any URL?
in which context should it recognize what URL and why?

The context where Roundcube should (and does) try to recognize URLs is
when displaying a plain text message. For convenience reasons we want
to make detected URLs clickable and not leave the user to copy & paste
it. This is done using regular expressions and we hereby stick to the
RFC specification of allowed chars in URLs which doesn't include any
8bit characters. Indeed, it's stupid for mail senders to not properly
encode their URLs and unfortunately there's little we can and want do
about this. It's already hard enough to reliably detect URLs in a
plain text string, especially finding the end of it. If 8bit
characters should be taken into account as well, we'll likely add more
characters from the surrounding text to the URL which may leads to
false detections even for correctly encoded URLs.


Thus, I'm sorry but this is strictly a sender issue and in this case
you'd need to manually copy the URL and paste it to your browser's
location bar. You might argue that FF supports these URLs and you're
right. But unlike Roundcube, FF understands the entire string to be an
URL and doesn't need to "find" it within a random text. Therefore FF
can accept any string of characters. But also FF first converts it
into proper URL encoded characters before it actually sends the URL to
the server.


Hi Thomas,

let me disagree here. While it's sort of true that a *real* URL may onlycontain a limited subset of ASCII characters, there's also such thing as*visible* URLs, which should be taken into account. As an extremeexample, Russia has had the .рф (Cyrillic) top-level domain [1] forquite some time now. Most, if not all, subdomains of that domain arewritten in Cyrillic characters. And surely, the web servers servingthese domains might contain pages with Cyrillic names as well.Technically, URL's of these pages would are a mix of punycode and URLescaped entities (%xx%yy%zz...). However, from a users point of view,such low-level representation is absolutely unfriendly and looks like abunch of random symbols. I think most of the users would favor writingURL's like these in native alphabet instead of their low-level ASCIIrepresentation.

Regarding difficulty of detection, I would dare to disagree with you aswell. Since PHP 5.1, PCRE has had support for Unicode characterproperties, so I'm pretty sure that it must be possible to add allalphanumeric characters to your regex easily.


Regards,
Rimas

[1] http://en.wikipedia.org/wiki/.%D1%80%D1%84 . Note how this lookshardly readable compared to http://en.wikipedia.org/wiki/.рф .


_______________________________________________
Roundcube Development discussion mailing list
[email protected]
http://lists.roundcube.net/mailman/listinfo/dev

Re: [RCD] URLs with 8bit chars?

Reply via email to