Displaying Chinese on HTML
Title: Displaying Chinese on HTML Hello friends, I am using Unicode(utf-8) to store data in Oracle database(database character set is UTF-8). In one of columns of the table,I have stored data in Chinese in UTF-8 format. Now I want to retrieve it on HTML page through java Servlet using JDBC. I have used html tag meta content=text/html charset=UTF-8 I use Netscape navigator 4.6 and have set the character set as UTF-8 in browser. I also have lot of chinese fonts in browser but I don'nt know which one to choose. Anyway I tried various combinations but chinese never got displayed ! The html page shows inverted question marks as output !!! Can somebody help in displaying the desired language. Also while accepting chinese or any language from HTML,do my servlets code need to change ? Or will it automatically convert it from HTML's UTF-8 input to UCS2 and transparently insert the string into Oracle database as UTF-8 using JDBC ! King regards, Parvinder
RE: Furigana codes?
Daniel Biddle wrote: On Wed, 5 Jul 2000, Rick McGowan wrote: iRck I thought this was a typo until I saw your address. U263A It's not a typo: Rick's signature has passed through an Indic renderer, so the "i" was reordered. U+FF1AU+FF0DU+FF09 _ Maco`
Re: Any other Italians on Unicode List? (was RE: French annotated Cha
Such lists of translations for the glossary terms in Unicode would be quite useful. If these are produced, be sure to request their addition to Useful Resources on the Unicode site. Mark Antoine Leca wrote: Patrick Andries wrote: - Original Message - From: "Marco Piovanelli" [EMAIL PROTECTED] Marco Cimarosti ([EMAIL PROTECTED]) wrote: Patrick Andries' lexicon made me wonder how some of these terms would possibly be like in Italian!? kerning: kerning (?) I think I've seen this translated as "crenatura". I think so too, seems like the origin is the same as the French equivalent (crénage). http://www.pcpratico.futura-ge.com/servizi/glossario/vocaboli/Voc_crenatura. asp, http://www.giofuga.com/lettering/lettcre.htm . This is also what URL:http://www.irisa.fr/faqtypo/dico.html gives. By the way, there are a lot of typography-related terms there, but this is an on-going effort, so if French or Italians (or Spanish or German) fluent persons can give a help, I am certain Jacques André will be delightful. (In fact, he already asked me, but the limit of my knowledge are much below the current content ;-)). Antoine
Re: Planes 1 and 2
At 9:11 PM -0800 7/5/00, Doug Ewell wrote: Kenneth Whistler [EMAIL PROTECTED] wrote: If you want the general planned layout for Planes 1 or 2, the best source is Michael Everson's graphical roadmaps, located at: http://www.egt.ie/standards/iso10646/ucs-roadmap.html The link to Plane 2 was broken the last time I checked. Well, but the roadmap for Plane 2 is *really* simple. You just fill it with 40,000+ ideographs from the bottom up. -- = John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://www.blueneptune.com/~tseng
Re: Planes 1 and 2
Ar 07:17 -0800 2000-07-06, scríobh John H. Jenkins: The link to Plane 2 was broken the last time I checked. 'Tis fixed. Well, but the roadmap for Plane 2 is *really* simple. You just fill it with 40,000+ ideographs from the bottom up. You missed the WG2 meeting where we argued whether to fill it from the bottom up or the top down ;-) Michael Everson ** Everson Gunn Teoranta ** http://www.egt.ie 15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland Vox +353 1 478 2597 ** Fax +353 1 478 2597 ** Mob +353 86 807 9169 27 Páirc an Fhéithlinn; Baile an Bhóthair; Co. Átha Cliath; Éire
RE: Displaying Chinese on HTML
Please be kind and make sure not to send messages in HTML format. Thank you. Comments below. From: Parvinder Singh(EHPT) [mailto:[EMAIL PROTECTED]] Sent: Thursday, July 06, 2000 12:23 AM Hello friends, I am using Unicode(utf-8) to store data in Oracle database (database character set is UTF-8). In one of columns of the table,I have stored data in Chinese in UTF-8 format. Now I want to retrieve it on HTML page through java Servlet using JDBC. I have used html tag meta content=text/html charset=UTF-8 I use Netscape navigator 4.6 and have set the character set as UTF-8 in browser. I also have lot of chinese fonts in browser but I don'nt know which one to choose. Anyway I tried various combinations but chinese never got displayed ! I pointed my browser to http://www.trigeminal.com/samples/provincial.html, which gives samples of text in a whole bunch of languages, courtesy of Michael Kaplan. I found that Netscape (I have 4.7) fails to display many of the lines, while Explorer (5.00) shows every one for which I have a font (I am doing this from NT4SP5). I have not yet investigated why Netscape isn't rendering (hints, snyone?). I suggest you try reading your pages with Explorer first. This may help you identify that you have valid pages. Use the above link to "calibrate" your browsers first! The html page shows inverted question marks as output !!! This is expected behavior, I think. I get hollow boxes for the unrenderable characters, but I've heard of other people getting question marks. I do not know if the question marks and hollow boxes each mean something different, though. Can somebody help in displaying the desired language. Also while accepting chinese or any language from HTML,do my servlets code need to change ? Or will it automatically convert it from HTML's UTF-8 input to UCS2 and transparently insert the string into Oracle database as UTF-8 using JDBC ! If you're reading it in as UTF-8 and storing it as UTF-8, why do you want to convert it? Am I missing something? King regards, Is King your dog? ;-) HTH, /|/|ike Parvinder
Re: Control characters
On Wed, 5 Jul 2000, john wrote: IIRC, the Model 37 Teletype interpreted 0A as a newline function, Also models 33 and 38, which also interpreted x0D as carriage return. Definitely not true of the model 33; it interpreted 0A as a line-feed, and if it received one not preceded by 0D it would do this. (Hopefully, you are all reading this email with a fixed-width font as God intended.) so ASCII allowed 0A to be interpreted as either LF or NL. That's non sequitur, but folks are like that. How so? The LF behavior is different from the NL behavior. DEC OSes notoriously distorted or misused the control characters, thus ^U = NAK was used to kill an input line instead of ^X = cancel. Since some of these editing commands were actually merely echoed back from the main processor to the comm control unit through which the terminal was connected, Definitely not true of any DEC OS; control characters were echoed as ^A, ^B, etc. there was some fogging over of the concepts of source and destination. The comm controller would buffer up what was typed until it got a CR (0x0D) and so these editing controls were actually commands to that comm controller to clear its buffer. Again, not true of any DEC OS; characters were interpreted one by one and selectively echoed by the CPU only. There were no buffering serial-line controllers for the PDP-8, and they weren't introduced for the PDP-11 until later -- and even then, the typical mode was to stop buffering on *any* control character. -- John Cowan [EMAIL PROTECTED] "You need a change: try Canada" "You need a change: try China" --fortune cookies opened by a couple that I know
Re: Planes 1 and 2
En Michael Everson ha escrit: Well, but the roadmap for Plane 2 is *really* simple. You just fill it with 40,000+ ideographs from the bottom up. You missed the WG2 meeting where we argued whether to fill it from the bottom up or the top down ;-) I certainly should have miss something, but it was my understanding that with CJK, when filling from top to bottom, we should fill from left to right, i.e. from U+2xxxF to U+2xxx0. But WG2 is *not* doing so, ain't you? ;-) Antoine
RE: How-To handle i18n when you don't know charset?
If I'm dealing with e-mail (POP3 and SMTP), do I necessarily want to respond to the user in the same charset as their original message to me? So far, I've convinced myself of No. Leon -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Thursday, July 06, 2000 12:12 PM To: Unicode List Cc: Unicode List Subject: Re: How-To handle i18n when you don't know charset? Here are some general guidelines you might like to consider :- o Have the UI layer pass a tag identifying the character encoding unless the UI layer maps the data to one of the Unicode representations (UTF-8, UTF-16) before passing the data on. o Have the UI layer pass a tag identifying the locale (language+region). You'll need this if your back end does any locale sensitive operations such as sorting and is independent of the encoding issue. o Have all the pages generated include a META-CHARSET tag in the HTML Header. This will insure that the browser(s) submit form post data in the same encoding as the original html page. May be the source of your original problem. Jim Mike Brown wrote: What is the best way to handle i18n when you are passed a string and you don't know the charset? I assume iso-8859-1 when I don't know the charset BUT on some Spanish environments my data is coming out garbage. It seems some of the characters are coming from iso-8859-2 (at least that's my first look). My component that handles processing of the string data is separate from the GUI where the user enters the data. The UI doesn't pass me any charset information. Is the GUI collecting data through an HTML form? Browsers are intentionally disregarding the recommendations and sending form data without charset information "to keep old scripts from breaking". That's the argument I heard, anyway. I'm fighting this battle, myself. What is the receiving end to do? I can tell you what I came up with. If the browser is IE4 or IE5, there is an undocumented MS DHTML property, document.charset, which will tell you what charset the browser used to interpret the bytes of the HTML document, and in IE4/IE5's case, this will also be the charset used in the form submission. Here's an HTML snippet I use to report what the browser is assuming the HTML document's charset is: script type="text/javascript"!-- if ( navigator.userAgent.toLowerCase().indexOf("msie") != -1 (parseInt(navigator.appVersion) = 4) ) { document.write( "pttSince you are using IE 4.0 or higher and do not have scripting disabled, I can tell that this generated HTML document is being intepreted by the browser as u" + document.charset + "/u and that the browser's default encoding happens to be u" + document.defaultCharset + "/u./tt/p" ); } //--/script In theory you could pass this as a hidden parameter in the form dataset and then the receiving application can know to look for it. However this will require being able to re-scan the bytes in the form data part of the HTTP message so that they can be properly interpreted, so a typical one-pass HTTP servlet will not suffice. I'm not sure how it works in IE3 although I read that the charset for form data submissions will be determined by the OS's locale in that browser. Netscape Navigator 4.x is no better. Haven't tested Mozilla. Regardless of the browser, you could also examine the Accept-Language HTTP header, the highest priority value in which you can take and map to a *likely* charset by relying on your environment's Locale resource bundles (Java Servlet Programming, pages 380-394) and a table of fallback mappings. However this approach makes some really bad assumptions is at best a stab in the dark. Some applications just outright put a select box in the form and rely on the user to pick the language they're using. This still makes some assumptions, though, because as you pointed out with Spanish, there's not always a single charset for each language. Since I'm in a Java environment, isn't there be a way to go to UTF-8 and from UTF-8 determine the corresponding ISO (and other) charset? No, there's nothing special about UTF-8 in this instance. You're dealing with a mystery sequence of bytes. You know they represent characters, but you don't know how the mappings work. Is it a one-to-one mapping of bytes to characters, or are some bytes taken 2, 3 or 4 at a time? You don't even know that much. Which bytes or byte sequences map to which characters? UTF-8 a charset that maps 1 to 6 bytes to a character; ISO-8859-x is a charset that maps 1 byte to a character. (Before someone corrects me, I'm using the definition of charset as per UTR #17, and yes, I realize that charsets have bytes that map to non-characters.) Once you assume a charset, the only way you're going to
RE: How-To handle i18n when you don't know charset?
You can get the charset much easier: IE5 and later IE fill a field "_charset_" with the charset used for form submission, regardless of the initial value of this field. Other browsers will return data in the charset of the FORM page and if you can set the charset of the FORM page you can also set this field to indicate the charset used to the CGI. Works the same for GET and PUT methods. IE4 and IE5 will submit characters that do not fit into the charset used for form submission as HTML numeric character references (#12345;) Simplest is to use UTF-8 throughout and label your FORM page with it, you just need to block browsers below version 4 or code specially for them. -Original Message- From: Mike Brown [mailto:[EMAIL PROTECTED]] Sent: Thursday, July 06, 2000 11:19 AM To: Unicode List Subject: RE: How-To handle i18n when you don't know charset? What is the best way to handle i18n when you are passed a string and you don't know the charset? I assume iso-8859-1 when I don't know the charset BUT on some Spanish environments my data is coming out garbage. It seems some of the characters are coming from iso-8859-2 (at least that's my first look). My component that handles processing of the string data is separate from the GUI where the user enters the data. The UI doesn't pass me any charset information. Is the GUI collecting data through an HTML form? Browsers are intentionally disregarding the recommendations and sending form data without charset information "to keep old scripts from breaking". That's the argument I heard, anyway. I'm fighting this battle, myself. What is the receiving end to do? I can tell you what I came up with. If the browser is IE4 or IE5, there is an undocumented MS DHTML property, document.charset, which will tell you what charset the browser used to interpret the bytes of the HTML document, and in IE4/IE5's case, this will also be the charset used in the form submission. Here's an HTML snippet I use to report what the browser is assuming the HTML document's charset is: script type="text/javascript"!-- if ( navigator.userAgent.toLowerCase().indexOf("msie") != -1 (parseInt(navigator.appVersion) = 4) ) { document.write( "pttSince you are using IE 4.0 or higher and do not have scripting disabled, I can tell that this generated HTML document is being intepreted by the browser as u" + document.charset + "/u and that the browser's default encoding happens to be u" + document.defaultCharset + "/u./tt/p" ); } //--/script In theory you could pass this as a hidden parameter in the form dataset and then the receiving application can know to look for it. However this will require being able to re-scan the bytes in the form data part of the HTTP message so that they can be properly interpreted, so a typical one-pass HTTP servlet will not suffice. I'm not sure how it works in IE3 although I read that the charset for form data submissions will be determined by the OS's locale in that browser. Netscape Navigator 4.x is no better. Haven't tested Mozilla. Regardless of the browser, you could also examine the Accept-Language HTTP header, the highest priority value in which you can take and map to a *likely* charset by relying on your environment's Locale resource bundles (Java Servlet Programming, pages 380-394) and a table of fallback mappings. However this approach makes some really bad assumptions is at best a stab in the dark. Some applications just outright put a select box in the form and rely on the user to pick the language they're using. This still makes some assumptions, though, because as you pointed out with Spanish, there's not always a single charset for each language. Since I'm in a Java environment, isn't there be a way to go to UTF-8 and from UTF-8 determine the corresponding ISO (and other) charset? No, there's nothing special about UTF-8 in this instance. You're dealing with a mystery sequence of bytes. You know they represent characters, but you don't know how the mappings work. Is it a one-to-one mapping of bytes to characters, or are some bytes taken 2, 3 or 4 at a time? You don't even know that much. Which bytes or byte sequences map to which characters? UTF-8 a charset that maps 1 to 6 bytes to a character; ISO-8859-x is a charset that maps 1 byte to a character. (Before someone corrects me, I'm using the definition of charset as per UTR #17, and yes, I realize that charsets have bytes that map to non-characters.) Once you assume a charset, the only way you're going to know whether it was the right choice, aside from recognizing invalid byte sequences for certain charsets like UTF-8 and UTF-16[BE/LE], is when you look at the characters you got and say "hey that's not what I was expecting". So the only solution seems to me to be to know precisely what you are expecting to receive (known character sequences), and what those sequences look like as byte sequences in different encodings. I think the only way to do it right is to come up with