On Wed, 17 Dec 2003, Jan Willem Stumpel wrote:

> [EMAIL PROTECTED] wrote:
>
> > <http://ken2403king.kir.jp/form.htm>
>
> ThatÂs a funny one, indeed. When I opened it in Mozilla it was
> displayed as åæååååäåæååå.For a moment I thought it
> was Chinese (which I do not know) but it is gibberish. Mozilla
> thought it was Chinese Simplified GB 18030. The source says <html
> LANG="ja">. It is Japanese with shift-jis encoding, in reality it
> says ãåãåããããçãèã. (IsnÂt Unicode fun, allowing to put
> both variants in a mail message, just by copying from the Mozilla
> screen like this..)
>
> So, isnÂt the LANG attribute *more* irrelevant, because it did not
> help Mozilla (1.5a) to display the text correctly?

  It's impossible to infer the document encoding from 'lang' tag.
With NCRs, any document encoding can be used to represent any Unicode
characters. Even if that's not the case, how could you know if it's
Shift_JIS, EUC-JP or ISO-2022-JP or EUC-JP (with JIS X 0213) _purely_
based on the value of 'lang' (suppose we don't have UTF-8, UTF-16, UTF-32,
for the sake of argument).  The value of 'lang' plays a role ONLY after
the identity of characters in documents are determined. See below.

> A META tag
> attribute "charset=shift-jis" added to (a copy of) the page did.
> DoesnÂt that mean that "encoding" is more relevant than "language"?

 Internally, Mozilla works in terms of Unicode. That is,
it has to determine the document encoding correctly (to convert a
'byte stream' in the document to render) to a Unicode character 'stream'
before doing any font selection.  If it mistakes Shift_JIS for GB18030,
what the character drawing routine receives doesn't make sense and the
'langGroup' inferred from the document encoding is "in conflict with"
(with NCRs to represent any Unicode characters, whether they're covered
by the current document encoding, this could happen all the time) the
language specified in the document(a part thereof). Which one is given a
higher priority? IIRC, it's the latter. So Mozilla tries to render what
it regards as 'a document in GB18030' (which is actually in Shift_JIS)
with Japanese fonts if possible.

BTW, as you know, GB18030 is another UTF  so that even without resorting
to NCRs (&#xhhhh(hh); or &#dddd..;) it can cover the full range of Unicode.

  Another BTW, it depends on your setting in
View | Character coding | Autodetect setting which character encoding
Mozilla comes up with for unlabelled documents.  If it's set to Chinese,
it'll come up with one of Chinese encodings for a Shift_JIS document.
Therefore, properly labelling html/xhtml/css documents is very important. Try
the document in question with the html/xhtml validator at
http://validator.w3.org and see what it says)

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to