Re: [ilugd] Publishing UTF-8 encoded multilingual XHTML documents on web

Mithun Bhattacharya Fri, 28 Apr 2006 07:00:12 -0700


--- "à¤à¤¶à¥à¤· à¤¶à¥à¤à¥à¤²à¤¾ \"Wah Java !!\""
<[EMAIL PROTECTED]> wrote:


> Hi Gora G,
> 

> First of all, sorry, the ISO-8859-1'ed doc's URL is:
> http://unixclan.no-ip.org/~21287/index.html
> 
> And now, BOM'd UTF-8 document's URL is:
> http://unixclan.no-ip.org/~21287/index-bom.html
> 
> Well, your suggestion works in Konqueror 3.5.2 (which I'm not
> expecting it to 
> work, because Konqueror, has to interpret BOM characters based on
> current 
> encoding which is ISO-8859-1, therefore Konqueror should ignore it,
> but it uses 
> BOM to set encoding, which is not acceptable according to HTML
> specification), 
> but not in Mozilla Firefox 1.5 which displays BOM characters as it
> is.
> 
> I think this is the problem with HTML specification which says, HTTP
> header 
> emitted by server should be given priority in deciding content-type.
> But 
> according to me, only a document knows in what encoding it is
> encoded, therefore 
> document's encoding should be given priority.

I am not sure which HTML specification you are looking at but the W3
page says quite opposite of what you are claiming

http://www.w3.org/TR/html4/charset.html
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

Basically a sample interaction between a browser and a HTTP server goes
like this in terms of document encoding:

1. Browser sends request to the webserver with the Accept-Charset
header eg Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7. The charset
could be a list in which case the values are in decreasing order of
priority. The q value mentions the allowed degradation in quality of
the content if selecting the specific charset in this case utf-8 or any
charset other than ISO-8859-1
2. Server responds with the charset as part of the content-type header
eg content-type:text/html; charset=ISO-LATIN-7 If none of the
acceptable charset mentioned by the browser is available at the server
side then a 406 response is sent.

The majority of the problem starts now. The standards say that the
content-type specified by the server is a recommendation or a guideline
and not an overriding instruction. The browser is supposed to accept
the data in good faith but is supposed to use it's own judegement in
handling the data. This is the reason why all browser give you an
option to change the charset being used to render the current page.

Next problem is UTF-8 encoding itself. This was developed after UTF-16
and UTF-32 came into the picture primarily because it was backward
compatible with ISO-8859-1. Do note most browser and HTTP server will
default to ISO-8859-1 if a specific character set is not defined.
Therefore the first 127 characters are exactly the same in UTF-8 and
ISO-8859-1 Any attempt at autodetecting character encoding will fail
since there is no way to differentiate between a UTF-8 encoded
character or a two ISO-8859-1 encoded characters. Thats the reason why
you see funny characters on your screen if there is a missmatch in the
server response and the page encoding.

There is a way around it too as mentioned here 
http://www.w3.org/TR/html4/charset.html#encodings Basically if you
follow standards there is no scope for default value for document
character set encoding. You are supposed to specifically mention which
characterset to use to render the document inside the document itself.
<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
This should be at the begining of the document and in english
(ISO-8859-1 has some unfair advantage since it was the first encoding
used on the web) In theory every document should have it and every
renderer should strictly adhere to it - irrespective of what anything
else might suggest - unless ofcourse the user wants to override it. In
practice we all default to ISO-8859-1 and follow the server side
recommendation or specific document encoding if present otherwise most
of the internet wouldnt render at all.

Coming to BOM I refer to http://www.w3.org/TR/html4/charset.html - read
the section "Notes on specific encodings" which seems to say BOM should
be used only if UTF-16 data is present. Also it should be the first
byte to be transmitted to the user-agent - I am not sure whether that
implies it should be before the HTTP headers or after that.

I guess that is all I can think of at the moment. Hopefulyl it has been
of some use.



Mithun

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

_______________________________________________
ilugd mailinglist -- ilugd@lists.linux-delhi.org
http://frodo.hserus.net/mailman/listinfo/ilugd
Archives at: http://news.gmane.org/gmane.user-groups.linux.delhi 
http://www.mail-archive.com/ilugd@lists.linux-delhi.org/

Re: [ilugd] Publishing UTF-8 encoded multilingual XHTML documents on web

Reply via email to