From: "D. Starner" <[EMAIL PROTECTED]>
Some won't convert any and will just start using UTF-8
for new ones. And this should be allowed.

Why should it be allowed? You can't mix items with different unlabeled encodings willy-nilly. All you're going to get, all you can expect to get is a mess.

When you say "you can't", it's excessive when speaking about filesystems, which DO NOT label their encoding, and allow multiple users to use and create files on shared filesystems with different locales having each a differnt encoding.


So it does happen that the same filesystem stores multiple encodings for its filenames. It also happens that systems allow mounting remote filesystems shared on systems using distinct system encodings (so even if a filesystem is consistent, these filenames appear with various encodings, and this goes to more complex situations when they are crosslinked with soft links or URLs.

Think about the web: it's a filesystem in itself, which uses names (URLs) include inconsistent encodings. Although there's a recommandation to use UTF-8 in URLs, this is not mandatory, and there are lots of hosts that use URLs created with some ISO-8859 charsets, or even Windows or Macintosh codepages.

To resolve some problems, HTML specifications allow additional (but out-of-band) attributes to resolve the encoding used for resource contents, but this has no impact on URLs themselves.

The current solution is to use "URL-encoding" and treat them as binary sequences with a restricted set of byte values, but this time it means transforming what was initially plain-text into some binary moniker.

Unfortunately, many web search engines do use the URLs to qualify the pertinence of search keywords, instead of treating them only as blind monikers.

Lots has been done to internationalize the domain names for use in IRIs, but URLs remain a mess and a mixture of various charsets, and IRIs are still rarely supported on browsers.

The problem with URLs is that they must be allowed to contain any valid plain-text, notably for Form-Data, submitted with a GET method, because this plain-text data becomes part of a query-string, itself part of the URL. HTML does allow specifiying in the HTML form which encoding should be used for this form data, because servers won't always expect a single and consistent encoding; the absence of this specification is often interpreted in browsers as meaning that form-data must be encoded with the same charset as the HTML form itself, but not all browsers observe this rule (in addition many web pages are incorrectly labelled, simply because of incorrect or limited HTTP server configurations, and the standards specify that the charset specified in the HTTP headers have priority to the charset specified in encoded documents themselves; this was a poor decision, which is inconsistent with the usage of the same HTML documents on filesystems that do not store the charset used for the file content)...

So don't think that this is simple. It is legitimate to be able to refer to some documents which we know are plain-text, but have unknown or ambiguous encodings (and there are many works related to the automated identification of lguage/charset pairs used in documents; none of these method are 100% exempt of false guesses).

For clients trying to use these resources with ambiguous or unknown encodings, but that DO know that this is effectly plain-text (such as a filename), the solution to eliminate (ignore, not show, discard...) all filenames or documents that look incorrectly encoded may be the worst solution: it gives no information to the user that these documents are missing, and this does not allow these users to even determine (even if characters are incorrectly displayed) which alternate encoding to try. It's legitimate to think about solution allowing at least partial representation of these texts, so that the user can look at how it is effectively encoded and get hints about how to select the appropriate charset. Also, very lossy conversions (with U+FFFD) are not satisfying enough.




Reply via email to