--- "à¤à¤¶à¥à¤· शà¥à¤à¥à¤²à¤¾ \"Wah Java !!\"" <[EMAIL PROTECTED]> wrote:
> Hi Gora G, > > First of all, sorry, the ISO-8859-1'ed doc's URL is: > http://unixclan.no-ip.org/~21287/index.html > > And now, BOM'd UTF-8 document's URL is: > http://unixclan.no-ip.org/~21287/index-bom.html > > Well, your suggestion works in Konqueror 3.5.2 (which I'm not > expecting it to > work, because Konqueror, has to interpret BOM characters based on > current > encoding which is ISO-8859-1, therefore Konqueror should ignore it, > but it uses > BOM to set encoding, which is not acceptable according to HTML > specification), > but not in Mozilla Firefox 1.5 which displays BOM characters as it > is. > > I think this is the problem with HTML specification which says, HTTP > header > emitted by server should be given priority in deciding content-type. > But > according to me, only a document knows in what encoding it is > encoded, therefore > document's encoding should be given priority. I am not sure which HTML specification you are looking at but the W3 page says quite opposite of what you are claiming http://www.w3.org/TR/html4/charset.html http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html Basically a sample interaction between a browser and a HTTP server goes like this in terms of document encoding: 1. Browser sends request to the webserver with the Accept-Charset header eg Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7. The charset could be a list in which case the values are in decreasing order of priority. The q value mentions the allowed degradation in quality of the content if selecting the specific charset in this case utf-8 or any charset other than ISO-8859-1 2. Server responds with the charset as part of the content-type header eg content-type:text/html; charset=ISO-LATIN-7 If none of the acceptable charset mentioned by the browser is available at the server side then a 406 response is sent. The majority of the problem starts now. The standards say that the content-type specified by the server is a recommendation or a guideline and not an overriding instruction. The browser is supposed to accept the data in good faith but is supposed to use it's own judegement in handling the data. This is the reason why all browser give you an option to change the charset being used to render the current page. Next problem is UTF-8 encoding itself. This was developed after UTF-16 and UTF-32 came into the picture primarily because it was backward compatible with ISO-8859-1. Do note most browser and HTTP server will default to ISO-8859-1 if a specific character set is not defined. Therefore the first 127 characters are exactly the same in UTF-8 and ISO-8859-1 Any attempt at autodetecting character encoding will fail since there is no way to differentiate between a UTF-8 encoded character or a two ISO-8859-1 encoded characters. Thats the reason why you see funny characters on your screen if there is a missmatch in the server response and the page encoding. There is a way around it too as mentioned here http://www.w3.org/TR/html4/charset.html#encodings Basically if you follow standards there is no scope for default value for document character set encoding. You are supposed to specifically mention which characterset to use to render the document inside the document itself. <META http-equiv="Content-Type" content="text/html; charset=EUC-JP"> This should be at the begining of the document and in english (ISO-8859-1 has some unfair advantage since it was the first encoding used on the web) In theory every document should have it and every renderer should strictly adhere to it - irrespective of what anything else might suggest - unless ofcourse the user wants to override it. In practice we all default to ISO-8859-1 and follow the server side recommendation or specific document encoding if present otherwise most of the internet wouldnt render at all. Coming to BOM I refer to http://www.w3.org/TR/html4/charset.html - read the section "Notes on specific encodings" which seems to say BOM should be used only if UTF-16 data is present. Also it should be the first byte to be transmitted to the user-agent - I am not sure whether that implies it should be before the HTTP headers or after that. I guess that is all I can think of at the moment. Hopefulyl it has been of some use. Mithun __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com _______________________________________________ ilugd mailinglist -- ilugd@lists.linux-delhi.org http://frodo.hserus.net/mailman/listinfo/ilugd Archives at: http://news.gmane.org/gmane.user-groups.linux.delhi http://www.mail-archive.com/ilugd@lists.linux-delhi.org/