On Fri, 19 Dec 2003, Jan Willem Stumpel wrote: > Jungshik Shin wrote: > > > It's impossible to infer the document encoding from 'lang' tag. > > Indeed, yes, I presented the URL inserted by jmaiorana to the W3C > HTML validator and it could not make any sense out of it. Still, > when I set Mozilla to 'autodetect Japanese' it correctly found it > to be shift-jis. So it is possible "in a way"; after all, there > are many text utilities (for Japanese only) that can guess (or > autodetect) encodings.
Sure, if you restrict the set of possible encodings to Shift_JIS, ISO-2022-JP, and EUC-JP (the same is true of Korean encodings, SC encodings, TC encodings, etc), it's usually possible to detect the encoding correctly. Some commerical 'encoding detectors' (such as that of BasisTech) reportedly do even better (over 95% or higher detection rate). Still, that's just a hint in case you want to get the language in which a document is written (this is the opposite of what we've discussed) because in html/xhtml, any encoding can be used to represent any characters. Of course, after guessing the encoding, one can do some linguistic/statistical analysis to 'determine' the langauge. > Aahh.. somethings now dawns on me: perhaps charset applies to the > WHOLE document and must be determined before any processing is > done, while lang can apply to individual sections? That is why > Mozilla does not 'trust' lang for determining/autodetecting the > encoding? Actually, you raised an interesting possibility. There's an _HTTP_ header 'Content-Language'. Mozilla might be able to take advantage of it. It should be an optional feature, but with the option on, Mozilla can turn to a charset detector corresponding to the value of 'Content-Language'. Well, it'd not be very useful. If an http server is configured (or a server-side script is written) to emit 'Content-Language' header, it's very likely that it emits 'Content-Type' header with 'charset' parameter so that there'd be no need for the charset detection. Another possibility is to make the universal charset detector to take into account the 'accept-language' list (see Edit|Preference|Navigator|Languages). > It will (and can) autodetect, but only when told to do > so by the user, not by the document. So probably jmaiorana (who > said the page displayed correctly) had autodetect Japanese ON. Alternatively, the 'universal detector' may have been turned on and it was successful in detecting the document as in Shift_JIS. Or, the default charset was set to Shift_JIS although not so likely given jmaiorana doesn't seem to be a Japanese. > > The value of 'lang' plays a role ONLY after the identity of > > characters in documents are determined. See below. > > Right. Yes, this is quite clear to me now (finally!). The Mozilla > algorithm is: > > 1. determine the encoding (for the whole document) from the > 'charset' attribute, or by auto-detection as specified by the > user. There are several other hints/clues/factors that go in here, but basically, you're right. > 2. determine the font (for the section concerned, which may be the > whole "body") from the 'lang' attribute. What's missing in your scenario is author-specified fonts. They're given more weight than (and combined with) 'lang' if 'allow documents to use other fonts' is checked. I think I should file a bug to replace 'allow ... other fonts' with something clearer (e.g. 'honor author-specified fonts' or 'ignore fonts specified by authors / in documents') because it's confusing as demonstrated by Edward's confusion. > If the attributes are missing, there are several fallback options > and defaults, > but this is the rule in principle. One default seems > to be 'the language group is Western'. I can put two fragments of Actually, no. I think I already explained this. I'd rather not repeat here. Instead, you can refer to my bug report at http://bugzilla.mozilla.org/show_bug.cgi?id=208479. You can also do the following experiment: $ env LC_ALL=ru_RU mozilla $ env LC_ALL=hi_IN mozilla $ env LC_ALL=ja_JP mozilla > I must still do a few more experiments to find out what the rule > is when no lang is specified but the UTF-8 character does not > occur in the Western font. (and also what the rules are which are > used by Xprint..) If you can decipher (I don't understand them fully) :-), you may want to take a look at http://lxr.mozilla.org/seamonkey/find?string=nsFontMetricsGTK.cpp (especially, FindFont and LocateFont) for the font selection mechanism 'shared' by GTK, Xlib, and Xprint. If you compare that with nsFontMetricsWin.cpp and nsFontMetricsXft.cpp, you'd realize why I don't like the XLFD-based font selection. > > BTW, as you know, GB18030 is another UTF so that even without > > resorting to NCRs (&#xhhhh(hh); or &#dddd..;) it can cover the > > full range of Unicode. > > No, I did not know this; I had assumed it was one of those Chinese > legacy things like eten or big5. Now I Googled a bit and found > that it is a Chinese government Unicode standard. What was wrong > with UTF-8 one wonders (rhetorical question, donÂt really want to > know the answer because it is probably very complicated). Nothing wrong with UTF-8. PRC government wanted to preserve the backward compatibility with GB2312 (should have been EUC-CN, but the name is so widely used that it's too late to rectify) and GBK (which is upwared-compatible with GB2312). So, in one and two byte ranges, GB18030 is identical to GB2312 and GBK except for a small set of code points. In the extended range of GB 18030(4byte), all the Unicode characters not covered by GBK are assigned. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/