https://bugs.freedesktop.org/show_bug.cgi?id=76080
--- Comment #16 from Stephan Bergmann <sberg...@redhat.com> --- Attachment 100862 is both broken and has "problematic" content: For one, as comment 14 notes, the HTML file is labelled as "charset=utf-8", but contains the raw bytes E8 9A F8 9E EC that do not constitute UTF-8. How has this broken file been generated? For another, the file URL contained in the <a> link is problematic: First, that file URL, as written in the HTML file, contains raw non-ASCII bytes (see above). How they should be interpreted when "extracting" the URL from the HTML file depends on the HTML file's encoding (UTF-8), but as noted above the file is broken and those bytes cannot be interpreted meaningfully. Different software in different scenarios (OS's locale settings, etc.) will likely respond in different ways when confronted with such broken input. Second, even if the URL could meaningfully be "extracted" from the HTML file, it would contain non-ASCII bytes. URLs are written in a subset of ASCII. If a URLs "payload" (which is, roughly, a sequence of arbitrary byte values) shall contain values that are outside ASCII, they need to be escaped as %XX sequences. Again, different software in different scenarios (OS's locale settings, etc.) will likely respond in different ways when confronted with such broken input. Third, even if the file URL's "payload" (i.e., a representation of a Windows pathname) could meaningfully be "extracted," as it contains non-ASCII bytes, it would be unclear how to interpret it as an actual Windows pathname. Windows pathnames are basically sequences of (16-bit) UTF-16 code units. An alternative way to access pathnames is via the OS's selected 8-bit character set (like windows-1250 etc.), where Windows internally translates between that 8-bit character set and UTF-16, and some valid UTF-16 pathnames can not be represented in certain 8-bit character sets, and the same 8-bit input sequence can denote different UTF-16 pathnames depending on the actually selected OS 8-bit character set. It is unspecified how (encodings of) non-ASCII bytes in a file URL's "payload" are to be interpreted on Windows, but general consensus appears to be to interpret them according to the OS's selected 8-bit character set (all the shortcomings of that approach notwithstanding). That, again, means that software in different scenarios (i.e., OS's locale settings) will likely respond in different ways when confronted with such "problematic" input. -- You are receiving this mail because: You are the assignee for the bug.
_______________________________________________ Libreoffice-bugs mailing list Libreoffice-bugs@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs