[Libreoffice-bugs] [Bug 76080] FILESAVE: URLs encoded into UTF-8 after saving HTML

bugzilla-daemon Mon, 12 Jan 2015 01:13:38 -0800

https://bugs.freedesktop.org/show_bug.cgi?id=76080


--- Comment #16 from Stephan Bergmann <sberg...@redhat.com> ---
Attachment 100862 is both broken and has "problematic" content:

For one, as comment 14 notes, the HTML file is labelled as "charset=utf-8", but
contains the raw bytes E8 9A F8 9E EC that do not constitute UTF-8.  How has
this broken file been generated?

For another, the file URL contained in the <a> link is problematic:

First, that file URL, as written in the HTML file, contains raw non-ASCII bytes
(see above).  How they should be interpreted when "extracting" the URL from the
HTML file depends on the HTML file's encoding (UTF-8), but as noted above the
file is broken and those bytes cannot be interpreted meaningfully.  Different
software in different scenarios (OS's locale settings, etc.) will likely
respond in different ways when confronted with such broken input.

Second, even if the URL could meaningfully be "extracted" from the HTML file,
it would contain non-ASCII bytes.  URLs are written in a subset of ASCII.  If a
URLs "payload" (which is, roughly, a sequence of arbitrary byte values) shall
contain values that are outside ASCII, they need to be escaped as %XX
sequences.  Again, different software in different scenarios (OS's locale
settings, etc.) will likely respond in different ways when confronted with such
broken input.

Third, even if the file URL's "payload" (i.e., a representation of a Windows
pathname) could meaningfully be "extracted," as it contains non-ASCII bytes, it
would be unclear how to interpret it as an actual Windows pathname.  Windows
pathnames are basically sequences of (16-bit) UTF-16 code units.  An
alternative way to access pathnames is via the OS's selected 8-bit character
set (like windows-1250 etc.), where Windows internally translates between that
8-bit character set and UTF-16, and some valid UTF-16 pathnames can not be
represented in certain 8-bit character sets, and the same 8-bit input sequence
can denote different UTF-16 pathnames depending on the actually selected OS
8-bit character set.  It is unspecified how (encodings of) non-ASCII bytes in a
file URL's "payload" are to be interpreted on Windows, but general consensus
appears to be to interpret them according to the OS's selected 8-bit character
set (all the shortcomings of that approach notwithstanding).  That, again,
means that software in different scenarios (i.e., OS's locale settings) will
likely respond in different ways when confronted with such "problematic" input.

-- 
You are receiving this mail because:
You are the assignee for the bug.

_______________________________________________
Libreoffice-bugs mailing list
Libreoffice-bugs@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs

[Libreoffice-bugs] [Bug 76080] FILESAVE: URLs encoded into UTF-8 after saving HTML

Reply via email to