JSSI problem with DBCS words !

Waily Wed, 25 Aug 1999 17:04:40 -0700
Apache Jssi 1.1.2 had problem when processing jhtml page having DBCS words.

After making jssi work under Apache 1.3.6 for OS/2, I test a jhtml
page with <SERVLET> </SERVLET> tag and found that if the jhtml page
had some DBCS words (Chinese Big5) before the <SERVLET> </SERVLET> tag
then the page generated by jssi will be wrong on words' position.
For example, I let jhtml only have one line like this :

-------- hello.jhtml ----------
<SERVLET CODE="HelloWorldServlet.class"> </SERVLET>
-------------------------------

Then all things work fine.

But if I insert A DBCS (Chinese Big5) word in front of above line like this:
( '§Ú' is A Chinese Big5 word may not understand by your system. )

-------- hello.jhtml ----------
§Ú
<SERVLET CODE="HelloWorldServlet.class"> </SERVLET>
-------------------------------

the page generated by jssi will contain a '>' in the last line of the page !
And try to add one more DBCS word like this :

-------- hello.jhtml ----------
§Ú§Ú
<SERVLET CODE="HelloWorldServlet.class"> </SERVLET>
-------------------------------

the page generated by jssi will contain a 'T>' in the last line of the page !

Java support Unicode and will display correct words by setting correct
charset. The reason for why one DBCS (a Unicode) word causing the
wrong word position is jssi do not count correct bytes for these words !
A Unicode needs two bytes but ASCII needs only one byte.
When Jssi parsing incoming jhtml page it will count the words in the
page and determine where to break the page for servlet code inserting
into it. However Java VM count one Unicode word as 'ONE' word and jssi
thinks 'ONE BYTE' , that's why 'T>' will appear in the last line of
generated page because jssi count '§Ú§Ú' two Unicode word as TWO BYTES
but actually its FOUR BYTES. Then jssi determine wrong words position
and shift two bytes so we get the 'T>' !

the following java code will show the result about 'one word but two bytes'.

--------------------
String a = "§ÚaªºbªBc¤Í";
System.out.println(a.length() + "  " + a);
String b = new String(a.getBytes("UTF8"), "ISO8859_1");
System.out.println(b.length() + "  " + b);
String c = new String(b.getBytes("ISO8859_1"), "UTF8");
System.out.println(c.length() + "  " + c);
--------------------

the result is ..

--------------------
7  §ÚaªºbªBc¤Í               // Unicode (Big5) display only 7 words count
15  ???a???b???c???          // ISO8859_1 display count 15 words (correct bytes)
7  §ÚaªºbªBc¤Í               // Back to Unicode display only 7 words count
--------------------


I had some thoughts about sloving the problem.

One, when parsing jhtml page, just using the
new String(a.getBytes("UTF8"), "ISO8859_1");
method to convert the content and count the correct bytes and then using
new String(b.getBytes("ISO8859_1"), "UTF8");
to convert back. Since ISO8859_1 is standard for many environment all over the
world, I suggest create one more init parameter named 'transEncoding' (or somename 
else)
only can be 'yes' or 'no' for determine if the jssi need to translate Encoding for
count the correct bytes by using ISO8859_1 and the user defined charset, one of the
jssi's init parameter, and let user put correct charset for their own country code.

Two, Java do support Unicode by using the InputStreamReader/OutputStreamWriter
for character encoding. If jssi can use these kind of io class then can solve
DBCS problem. Since Apache 1.3.6 for OS/2 Warp with Jserv 1.0final work fine
with the charset Big5 (UTF8) and display correct chinese words on Netscape browser
the jssi, however using InputStream/OutputStream io class and do not deal well
with DBCS words and counts wrong bytes ....


I hope this will make JSSI more greatful !


-- Waily Yang ------------------------------------- 
| Email      mailto:[EMAIL PROTECTED]           | 
| Homepage   http://www.HappyElec/Waily           | 
| Location   Taipei, Taiwan, R.O.C. (DBCS  BIG5)  | 
| Club       Team OS/2 in Taiwan|Power User Group | 
| Newsgroups news:tw.bbs.comp.os2                 | 
| Java & REXX & C++ Programmer using OS/2 TWarp   | 
--------------------------------------------------- 




------------------------------------------------------------
To subscribe:    [EMAIL PROTECTED]
To unsubscribe:  [EMAIL PROTECTED]
Problems?:       [EMAIL PROTECTED]
JSSI problem with DBCS words !

Reply via email to