Displaying Chinese on HTML

2000-07-06 Thread Parvinder Singh(EHPT)
Title: Displaying Chinese on HTML





Hello friends,
I am using Unicode(utf-8) to store data in Oracle database(database character set is UTF-8).


In one of columns of the table,I have stored data in Chinese in UTF-8 format.


Now I want to retrieve it on HTML page through java Servlet using JDBC.


I have used html tag meta content=text/html charset=UTF-8 


I use Netscape navigator 4.6 and have set the character set as UTF-8 in browser. I also have lot of chinese fonts in browser but I don'nt know which one to choose. Anyway I tried various combinations but chinese never got displayed !

The html page shows inverted question marks as output !!! 


Can somebody help in displaying the desired language.


Also while accepting chinese or any language from HTML,do my servlets code need to change ? Or will it automatically convert it from HTML's UTF-8 input to UCS2 and transparently insert the string into Oracle database as UTF-8 using JDBC ! 

King regards,
Parvinder








RE: Furigana codes?

2000-07-06 Thread Marco . Cimarosti

Daniel Biddle wrote:
 On Wed, 5 Jul 2000, Rick McGowan wrote:
  iRck
 I thought this was a typo until I saw your address. U263A

It's not a typo: Rick's signature has passed through an Indic renderer, so
the "i" was reordered. U+FF1AU+FF0DU+FF09

_ Maco`



Re: Any other Italians on Unicode List? (was RE: French annotated Cha

2000-07-06 Thread Mark Davis

Such lists of translations for the glossary terms in Unicode would be quite
useful. If these are produced, be sure to request their addition to Useful
Resources on the Unicode site.

Mark

Antoine Leca wrote:

 Patrick Andries wrote:
 
  - Original Message -
  From: "Marco Piovanelli" [EMAIL PROTECTED]
   Marco Cimarosti ([EMAIL PROTECTED]) wrote:
  
   Patrick Andries' lexicon made me wonder how some of these terms would
   possibly be like in Italian!?
  
   kerning:   kerning (?)
  
   I think I've seen this translated as "crenatura".
 
  I think so too, seems like the origin is the same as the French equivalent
  (crénage).
  http://www.pcpratico.futura-ge.com/servizi/glossario/vocaboli/Voc_crenatura.
  asp, http://www.giofuga.com/lettering/lettcre.htm .

 This is also what URL:http://www.irisa.fr/faqtypo/dico.html gives.

 By the way, there are a lot of typography-related terms there, but this is
 an on-going effort, so if French or Italians (or Spanish or German) fluent
 persons can give a help, I am certain Jacques André will be delightful.
 (In fact, he already asked me, but the limit of my knowledge are much below
 the current content ;-)).

 Antoine




Re: Planes 1 and 2

2000-07-06 Thread John H. Jenkins

At 9:11 PM -0800 7/5/00, Doug Ewell wrote:
Kenneth Whistler [EMAIL PROTECTED] wrote:

  If you want the general planned layout for Planes 1 or 2, the best
  source is Michael Everson's graphical roadmaps, located at:

  http://www.egt.ie/standards/iso10646/ucs-roadmap.html

The link to Plane 2 was broken the last time I checked.


Well, but the roadmap for Plane 2 is *really* simple.  You just fill 
it with 40,000+ ideographs from the bottom up.

-- 
=
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://www.blueneptune.com/~tseng



Re: Planes 1 and 2

2000-07-06 Thread Michael Everson

Ar 07:17 -0800 2000-07-06, scríobh John H. Jenkins:

The link to Plane 2 was broken the last time I checked.

'Tis fixed.

Well, but the roadmap for Plane 2 is *really* simple.  You just fill
it with 40,000+ ideographs from the bottom up.

You missed the WG2 meeting where we argued whether to fill it from the
bottom up or the top down ;-)

Michael Everson  **  Everson Gunn Teoranta  **   http://www.egt.ie
15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
Vox +353 1 478 2597 ** Fax +353 1 478 2597 ** Mob +353 86 807 9169
27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire





RE: Displaying Chinese on HTML

2000-07-06 Thread Ayers, Mike


Please be kind and make sure not to send messages in HTML format.
Thank you.  Comments below.

From: Parvinder Singh(EHPT) [mailto:[EMAIL PROTECTED]]
Sent: Thursday, July 06, 2000 12:23 AM

Hello friends, 
I am using Unicode(utf-8) to store data in Oracle database
(database character set is UTF-8). 
In one of columns of the table,I have stored data in Chinese
in UTF-8 format. 
Now I want to retrieve it on HTML page through java Servlet 
using JDBC. 
I have used html tag meta content=text/html charset=UTF-8 
I use Netscape navigator 4.6 and have set the character set as 
UTF-8 in browser. I also have lot of chinese fonts in browser 
but I don'nt know which one to choose. Anyway I tried various 
combinations but chinese never got displayed !

I pointed my browser to
http://www.trigeminal.com/samples/provincial.html, which gives samples of
text in a whole bunch of languages, courtesy of Michael Kaplan.  I found
that Netscape (I have 4.7) fails to display many of the lines, while
Explorer (5.00) shows every one for which I have a font (I am doing this
from NT4SP5).  I have not yet investigated why Netscape isn't rendering
(hints, snyone?).  I suggest you try reading your pages with Explorer first.
This may help you identify that you have valid pages.  Use the above link to
"calibrate" your browsers first!

The html page shows inverted question marks as output !!! 

This is expected behavior, I think.  I get hollow boxes for the
unrenderable characters, but I've heard of other people getting question
marks.  I do not know if the question marks and hollow boxes each mean
something different, though.

Can somebody help in displaying the desired language. 
Also while accepting chinese or any language from HTML,do my 
servlets code need to change ? Or will it automatically convert 
it from HTML's UTF-8 input to UCS2 and transparently insert 
the string into Oracle database as UTF-8 using JDBC !   

If you're reading it in as UTF-8 and storing it as UTF-8, why do you
want to convert it?  Am I missing something?

King regards, 

Is King your dog?  ;-)


HTH,

/|/|ike

Parvinder 
  



Re: Control characters

2000-07-06 Thread John Cowan

On Wed, 5 Jul 2000, john wrote:

  IIRC, the Model 37 Teletype interpreted 0A as a newline function,
 
 Also models 33 and 38, which also interpreted x0D as carriage return.

Definitely not true of the model 33; it interpreted 0A as a line-feed,
and if it received one not preceded by 0D
  it would do this.
(Hopefully, you are all reading this email with a fixed-width font as
God intended.)

  so ASCII allowed 0A to be interpreted as either LF or NL. 
 
 That's non sequitur, but folks are like that.

How so?  The LF behavior is different from the NL behavior.

  DEC OSes notoriously distorted or misused the control characters, thus
  ^U = NAK was used to kill an input line instead of ^X = cancel.
 
 Since some of these editing commands were actually
 merely echoed back from the main processor to the comm control
 unit through which the terminal was connected,

Definitely not true of any DEC OS; control characters were echoed as ^A, ^B,
etc.

 there was some
 fogging over of the concepts of source and destination.  The comm
 controller would buffer up what was typed until it got a CR (0x0D)
 and so these editing controls were actually commands to that comm
 controller to clear its buffer.

Again, not true of any DEC OS; characters were interpreted one by one
and selectively echoed by the CPU only.
There were no buffering serial-line controllers for the PDP-8, and they
weren't introduced for the PDP-11 until later -- and even then, the typical
mode was to stop buffering on *any* control character.

-- 
John Cowan   [EMAIL PROTECTED]
"You need a change: try Canada"  "You need a change: try China"
--fortune cookies opened by a couple that I know





Re: Planes 1 and 2

2000-07-06 Thread Antoine Leca

En Michael Everson ha escrit:
 
 Well, but the roadmap for Plane 2 is *really* simple.  You just fill
 it with 40,000+ ideographs from the bottom up.
 
 You missed the WG2 meeting where we argued whether to fill it from the
 bottom up or the top down ;-)

I certainly should have miss something, but it was my understanding that
with CJK, when filling from top to bottom, we should fill from left to
right, i.e. from U+2xxxF to U+2xxx0.

But WG2 is *not* doing so, ain't you?


;-)

Antoine



RE: How-To handle i18n when you don't know charset?

2000-07-06 Thread Leon Spencer


If I'm dealing with e-mail (POP3 and SMTP), do I necessarily
want to respond to the user in the same charset as their original
message to me? So far, I've convinced myself of No. 

Leon

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
 Sent: Thursday, July 06, 2000 12:12 PM
 To: Unicode List
 Cc: Unicode List
 Subject: Re: How-To handle i18n when you don't know charset?
 
 
 Here are some general guidelines you might like to consider :-
 
 o Have the UI layer pass a tag identifying the character 
 encoding unless the
 UI layer maps the data to one of the Unicode representations 
 (UTF-8, UTF-16)
 before passing the data on.
 
 o Have the UI layer pass a tag identifying the locale 
 (language+region). You'll
 need this if your back end does any locale sensitive 
 operations such as sorting
 and is independent of the encoding issue.
 
 o Have all the pages generated include a META-CHARSET tag in the HTML
 Header. This will insure that the browser(s) submit form post 
 data in the same
 encoding as the original html page. May be the source of your 
 original problem.
 
 Jim
 
 Mike Brown wrote:
 
   What is the best way to handle i18n when you are passed a 
 string and
   you don't know the charset? I assume iso-8859-1 when I 
 don't know the
   charset BUT on some Spanish environments my data is coming out
   garbage. It seems some of the characters are coming from 
 iso-8859-2
   (at least that's my first look).
  
   My component that handles processing of the string data 
 is separate
   from the GUI where the user enters the data. The UI 
 doesn't pass me
   any charset information.
 
  Is the GUI collecting data through an HTML form? Browsers 
 are intentionally
  disregarding the recommendations and sending form data 
 without charset
  information "to keep old scripts from breaking". That's the 
 argument I
  heard, anyway. I'm fighting this battle, myself. What is 
 the receiving end
  to do? I can tell you what I came up with.
 
  If the browser is IE4 or IE5, there is an undocumented MS 
 DHTML property,
  document.charset, which will tell you what charset the 
 browser used to
  interpret the bytes of the HTML document, and in IE4/IE5's 
 case, this will
  also be the charset used in the form submission. Here's an 
 HTML snippet I
  use to report what the browser is assuming the HTML 
 document's charset is:
 
  script type="text/javascript"!--
  if ( navigator.userAgent.toLowerCase().indexOf("msie") != -1  
  (parseInt(navigator.appVersion) = 4) ) { document.write( 
 "pttSince you
  are using IE 4.0 or higher and do not have scripting 
 disabled, I can tell
  that this generated HTML document is being intepreted by 
 the browser as u"
  + document.charset + "/u and that the browser's default 
 encoding happens
  to be u" + document.defaultCharset + "/u./tt/p" ); }
  //--/script
 
  In theory you could pass this as a hidden parameter in the 
 form dataset and
  then the receiving application can know to look for it. 
 However this will
  require being able to re-scan the bytes in the form data 
 part of the HTTP
  message so that they can be properly interpreted, so a 
 typical one-pass HTTP
  servlet will not suffice. I'm not sure how it works in IE3 
 although I read
  that the charset for form data submissions will be 
 determined by the OS's
  locale in that browser. Netscape Navigator 4.x is no 
 better. Haven't tested
  Mozilla.
 
  Regardless of the browser, you could also examine the 
 Accept-Language HTTP
  header, the highest priority value in which you can take 
 and map to a
  *likely* charset by relying on your environment's Locale 
 resource bundles
  (Java Servlet Programming, pages 380-394) and a table of 
 fallback mappings.
  However this approach makes some really bad assumptions is 
 at best a stab in
  the dark.
 
  Some applications just outright put a select box in the 
 form and rely on the
  user to pick the language they're using. This still makes 
 some assumptions,
  though, because as you pointed out with Spanish, there's 
 not always a single
  charset for each language.
 
   Since I'm in a Java environment, isn't there be a way to go
   to UTF-8 and from UTF-8 determine the corresponding ISO
   (and other) charset?
 
  No, there's nothing special about UTF-8 in this instance. 
 You're dealing
  with a mystery sequence of bytes. You know they represent 
 characters, but
  you don't know how the mappings work. Is it a one-to-one 
 mapping of bytes to
  characters, or are some bytes taken 2, 3 or 4 at a time? 
 You don't even know
  that much. Which bytes or byte sequences map to which 
 characters? UTF-8 a
  charset that maps 1 to 6 bytes to a character; ISO-8859-x 
 is a charset that
  maps 1 byte to a character. (Before someone corrects me, 
 I'm using the
  definition of charset as per UTR #17, and yes, I realize 
 that charsets have
  bytes that map to non-characters.)
 
  Once you assume a charset, the only way you're going to 
 

RE: How-To handle i18n when you don't know charset?

2000-07-06 Thread Chris Wendt

You can get the charset much easier:

IE5 and later IE fill a field "_charset_" with the charset used for form
submission, regardless of the initial value of this field.
Other browsers will return data in the charset of the FORM page and if you
can set the charset of the FORM page you can also set this field to
indicate the charset used to the CGI.
Works the same for GET and PUT methods.
IE4 and IE5 will submit characters that do not fit into the charset used for
form submission as HTML numeric character references (#12345;)

Simplest is to use UTF-8 throughout and label your FORM page with it, you
just need to block browsers below version 4 or code specially for them.



-Original Message-
From: Mike Brown [mailto:[EMAIL PROTECTED]]
Sent: Thursday, July 06, 2000 11:19 AM
To: Unicode List
Subject: RE: How-To handle i18n when you don't know charset?


 What is the best way to handle i18n when you are passed a string and
 you don't know the charset? I assume iso-8859-1 when I don't know the
 charset BUT on some Spanish environments my data is coming out
 garbage. It seems some of the characters are coming from iso-8859-2
 (at least that's my first look). 
  
 My component that handles processing of the string data is separate
 from the GUI where the user enters the data. The UI doesn't pass me
 any charset information. 

Is the GUI collecting data through an HTML form? Browsers are intentionally
disregarding the recommendations and sending form data without charset
information "to keep old scripts from breaking". That's the argument I
heard, anyway. I'm fighting this battle, myself. What is the receiving end
to do? I can tell you what I came up with.

If the browser is IE4 or IE5, there is an undocumented MS DHTML property,
document.charset, which will tell you what charset the browser used to
interpret the bytes of the HTML document, and in IE4/IE5's case, this will
also be the charset used in the form submission. Here's an HTML snippet I
use to report what the browser is assuming the HTML document's charset is:

script type="text/javascript"!--
if ( navigator.userAgent.toLowerCase().indexOf("msie") != -1  
(parseInt(navigator.appVersion) = 4) ) { document.write( "pttSince you
are using IE 4.0 or higher and do not have scripting disabled, I can tell
that this generated HTML document is being intepreted by the browser as u"
+ document.charset + "/u and that the browser's default encoding happens
to be u" + document.defaultCharset + "/u./tt/p" ); }
//--/script

In theory you could pass this as a hidden parameter in the form dataset and
then the receiving application can know to look for it. However this will
require being able to re-scan the bytes in the form data part of the HTTP
message so that they can be properly interpreted, so a typical one-pass HTTP
servlet will not suffice. I'm not sure how it works in IE3 although I read
that the charset for form data submissions will be determined by the OS's
locale in that browser. Netscape Navigator 4.x is no better. Haven't tested
Mozilla.

Regardless of the browser, you could also examine the Accept-Language HTTP
header, the highest priority value in which you can take and map to a
*likely* charset by relying on your environment's Locale resource bundles
(Java Servlet Programming, pages 380-394) and a table of fallback mappings.
However this approach makes some really bad assumptions is at best a stab in
the dark.

Some applications just outright put a select box in the form and rely on the
user to pick the language they're using. This still makes some assumptions,
though, because as you pointed out with Spanish, there's not always a single
charset for each language.

 Since I'm in a Java environment, isn't there be a way to go 
 to UTF-8 and from UTF-8 determine the corresponding ISO
 (and other) charset?

No, there's nothing special about UTF-8 in this instance. You're dealing
with a mystery sequence of bytes. You know they represent characters, but
you don't know how the mappings work. Is it a one-to-one mapping of bytes to
characters, or are some bytes taken 2, 3 or 4 at a time? You don't even know
that much. Which bytes or byte sequences map to which characters? UTF-8 a
charset that maps 1 to 6 bytes to a character; ISO-8859-x is a charset that
maps 1 byte to a character. (Before someone corrects me, I'm using the
definition of charset as per UTR #17, and yes, I realize that charsets have
bytes that map to non-characters.)

Once you assume a charset, the only way you're going to know whether it was
the right choice, aside from recognizing invalid byte sequences for certain
charsets like UTF-8 and UTF-16[BE/LE], is when you look at the characters
you got and say "hey that's not what I was expecting". So the only solution
seems to me to be to know precisely what you are expecting to receive (known
character sequences), and what those sequences look like as byte sequences
in different encodings.

I think the only way to do it right is to come up with