On Fri, 19 Dec 2003, Jan Willem Stumpel wrote:

> Jungshik Shin wrote:
>
> > It's impossible to infer the document encoding from 'lang' tag.
>
> Indeed, yes, I presented the URL inserted by jmaiorana to the W3C
> HTML validator and it could not make any sense out of it. Still,
> when I set Mozilla to 'autodetect Japanese' it correctly found it
> to be shift-jis. So it is possible "in a way"; after all, there
> are many text utilities (for Japanese only) that can guess (or
> autodetect) encodings.

  Sure, if you restrict the set of possible encodings to Shift_JIS,
ISO-2022-JP, and EUC-JP (the same is true of Korean encodings, SC
encodings, TC encodings, etc), it's usually possible to detect the
encoding correctly. Some commerical 'encoding detectors' (such as that of
BasisTech) reportedly do even better (over 95% or higher detection rate).
Still, that's just a hint in case you want to get the language in which a
document is written (this is the opposite of what we've discussed) because
in html/xhtml, any encoding can be used to represent any characters.
Of course, after guessing the encoding, one can do some
linguistic/statistical analysis to 'determine' the langauge.

> Aahh.. somethings now dawns on me: perhaps charset applies to the
> WHOLE document and must be determined before any processing is
> done, while lang can apply to individual sections? That is why
> Mozilla does not 'trust' lang for determining/autodetecting the
> encoding?

  Actually, you raised an interesting possibility. There's an
_HTTP_ header 'Content-Language'. Mozilla might be able to take
advantage of it.  It should be an optional feature, but with the
option on, Mozilla  can turn to a charset detector corresponding
to the value of 'Content-Language'. Well, it'd not be very useful.
If an http server is configured (or a server-side script is written)
to emit 'Content-Language' header, it's very likely that it emits
'Content-Type' header with 'charset' parameter so that there'd be no
need for the charset detection.  Another possibility is to make the
universal charset detector to take into account  the 'accept-language'
list (see Edit|Preference|Navigator|Languages).

> It will (and can) autodetect, but only when told to do
> so by the user, not by the document. So probably jmaiorana (who
> said the page displayed correctly) had autodetect Japanese ON.

 Alternatively, the 'universal detector' may have been turned on and
it was successful in detecting the document as in Shift_JIS.
Or, the default charset was set to Shift_JIS although not so likely
given jmaiorana doesn't seem to be a Japanese.


> > The value of 'lang' plays a role ONLY after the identity of
> > characters in documents are determined. See below.
>
> Right. Yes, this is quite clear to me now (finally!). The Mozilla
> algorithm is:
>
> 1. determine the encoding (for the whole document) from the
>    'charset' attribute, or by auto-detection as specified by the
>    user.

 There are several other hints/clues/factors that go in here, but
basically, you're right.

> 2. determine the font (for the section concerned, which may be the
>    whole "body") from the 'lang' attribute.

What's missing in your scenario is author-specified fonts. They're given
more weight than (and combined with) 'lang' if 'allow documents to use
other fonts' is checked.  I think I should file a bug to replace 'allow
... other fonts' with something clearer (e.g. 'honor author-specified
fonts' or 'ignore fonts specified by authors / in documents') because
it's confusing as demonstrated by Edward's confusion.


> If the attributes are missing, there are several fallback options
> and defaults,

> but this is the rule in principle. One default seems
> to be 'the language group is Western'. I can put two fragments of

  Actually, no. I think I already explained this. I'd rather not
repeat here. Instead, you can refer to my bug report at
http://bugzilla.mozilla.org/show_bug.cgi?id=208479. You can
also do the following experiment:

  $ env LC_ALL=ru_RU mozilla
  $ env LC_ALL=hi_IN mozilla
  $ env LC_ALL=ja_JP mozilla


> I must still do a few more experiments to find out what the rule
> is when no lang is specified but the UTF-8 character does not
> occur in the Western font. (and also what the rules are which are
> used by Xprint..)

  If you can decipher (I don't understand them fully) :-), you may want
to take a look at
http://lxr.mozilla.org/seamonkey/find?string=nsFontMetricsGTK.cpp
(especially, FindFont and LocateFont) for
the font selection mechanism 'shared' by GTK, Xlib, and Xprint.
If you compare that with nsFontMetricsWin.cpp and nsFontMetricsXft.cpp,
you'd realize why I don't like the XLFD-based font selection.


> > BTW, as you know, GB18030 is another UTF  so that even without
> > resorting to NCRs (&#xhhhh(hh); or &#dddd..;) it can cover the
> > full range of Unicode.
>
> No, I did not know this; I had assumed it was one of those Chinese
> legacy things like eten or big5. Now I Googled a bit and found
> that it is a Chinese government Unicode standard. What was wrong
> with UTF-8 one wonders (rhetorical question, donÂt really want to
> know the answer because it is probably very complicated).

 Nothing wrong with UTF-8. PRC government wanted to preserve
the backward compatibility with GB2312 (should have been EUC-CN, but the
name is so widely used that it's too late to rectify) and GBK (which is
upwared-compatible with GB2312). So, in one and two byte ranges, GB18030
is identical to GB2312 and GBK except for a small set of code points. In
the extended range of GB 18030(4byte), all the Unicode characters not
covered by GBK are assigned.

  Jungshik
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to