RE: UTF-8 and UTF-16 issues

2000-06-29 Thread Joe_Ross



Yes. we've done this a lot. It's usually a very easy transition.
Switching to UTF-16 means that you have to start representing
text using a different data type (not char). This usually requires
a lot of rework. It probably depends on how much legacy text
processing code you have. You will need a few UTF8 aware
versions of string handling functions. For example you won't
be able to use the OS mblen function, because your locale
encoding probably won't be UTF8. Of course you'll also need
conversion functions to convert between UTF8 and the current
OS/locale encoding.
Joe


"Jones, Bob" <[EMAIL PROTECTED]> on 06/28/2000 04:56:16 PM

To:   "Unicode List" <[EMAIL PROTECTED]>
cc:(bcc: Joe Ross/Tivoli Systems)
Subject:  RE: UTF-8 and UTF-16 issues




Has anyone out there taken a cross platform non-Unicode enabled legacy
application and converted it to run UTF-8 instead of UTF-16?  I've read
Markus Kuhn's UTF-8/Unicode FAQ at
http://www.cl.cam.ac.uk/~mgk25/unicode.html and while it was helpful, it
only addresses Unix.  I also have to consider Windows and the AS/400.  With
Windows, I assume you would have to build every thing with _UNICODE defined,
but leave your strings as char * or char arrays and that at every input
point convert from UTF-16 to UTF-8, or is there some other way within
Windows to tell the OS to give you data in Unicode, especially in a chosen
encoding scheme?

If you have done this with a legacy app, what problems did you run into?
Would you do it that way again or would you go ahead and bite the bullet and
modify all your code to be able to handle UTF-16?  It seems to me that the
big advantage of processing in UTF-16 instead of UTF-8 is sizes are
consistent for a given number of characters, but the big advantage of UTF-8
is that normal string manipulation still works and less code needs to be
modified.  What other pros and cons for the different encoding schemes are
there?

Also, what kind of tricks are used to deal with database column sizes?
Currently, our applications run on SQL Server, Oracle, and DB2/400 and they
all handle Unicode differently.  Right now we have the same database column
layout no matter which database is used.  I suspect that may have to change,
i.e. CHAR(10) becomes NCHAR(10) on SQL Server, CHAR(30) on Oracle, and
CHAR(20) on DB2/400.

Thanks,

Bob Jones
[EMAIL PROTECTED]

-Original Message-
From: Edward Cherlin [mailto:[EMAIL PROTECTED]]
Sent: Sunday, June 25, 2000 7:01 PM
To: Unicode List
Subject: Re: UTF-8 and UTF-16 issues


At 2:48 PM -0800 6/19/00, Markus Scherer wrote:
>"OLeary, Sean (NJ)" wrote:
> > UTF-16 is the 16-bit encoding of Unicode that includes the use of
> > surrogates. This is essentially a fixed width encoding.
>
>certainly not. utf-16, of course, is variable-width: 1 or 2 16-bit
>units per character. certainly the iuc discussion did not spread
>this under "utf-16" but possibly as "ucs-2".
[snip]

The essential distinction that Sean refers to is not that all
characters are encoded in the same length, but that all coding
elements are of the same length. This is in contrast not with ISO
10646, but with "double-byte" encodings of CJK text, where escape
sequences are used to switch between runs of 8-bit and 16-bit codes.

The point of the distinction is that in double-byte encodings the
only way to tell the length of the current character is by parsing
from the beginning of the file. In Unicode, the current 16-bit value
is explicitly a 16-bit character code (assigned, unassigned, or
Private Use), an upper surrogate code, a lower surrogate code, or not
a character code, without reference to what has gone before in the
file.


Edward Cherlin
Generalist
"A knot!" exclaimed Alice. "Oh, do let me help to undo it."
Alice in Wonderland






RE: Java, SQL, Unicode and Databases

2000-06-23 Thread Joe_Ross



Michael, are you saying that the data type (char or nchar) doesn't matter? Are
you saying that if we just use UTF-16 or wchar_t interfaces to access the data
all will be fine and we will be able to store multilingual data even in fields
defined as char? Maybe things aren't as bad as I feared.

With respect to the web applications you describe, do they store the UTF-8 as
binary data? This wouldn't work for us, since we want other data mining
applications to be able to access the same data.

Thanks,
Joe

"Michael Kaplan (Trigeminal Inc.)" <[EMAIL PROTECTED]> on 06/23/2000
10:41:39 AM

To:   Unicode List <[EMAIL PROTECTED]>, Joe Ross/Tivoli Systems@Tivoli Systems
cc:   Hossein Kushki@IBMCA
Subject:  RE: Java, SQL, Unicode and Databases




Microsoft is very COM-based for its actual data access methods and COM
uses BSTRs that are BOM-less UTF-16. Because of that, the actual storage
format of any database ends up irrelevant since it will be converted to
UTF-16 anyway.

Given that this is what the data layers do, performance is certainly better
if there does not have to be an extra call to the Windows
MutliByteToWideChar to convert UTF-8 to UTF-16. So from a Windows
perspective, not only is it no trouble, but it also the best possible
solution!

In any case, I know plenty of web people who *do* encode their strings in
SQL Server databases as UTF-8 for web applications, since UTF-8 is their
preference. They are willing to take the hit of "converting themselves"
because when data is being read it is faster to go through no conversions at
all.

Michael

> --
> From:   [EMAIL PROTECTED][SMTP:[EMAIL PROTECTED]]
> Sent:   Friday, June 23, 2000 7:55 AM
> To: Unicode List
> Cc: Unicode List; [EMAIL PROTECTED]
> Subject: Re: Java, SQL, Unicode and Databases
>
>
>
> I think that this is also true for DB2 using UTF-8 as the database
> encoding.
> From an application perspective, MS SQL Server is the one that gives us
> the most
> trouble, because it doesn't support UTF-8 as a database encoding for char,
> etc.
> Joe
>
> Kenneth Whistler <[EMAIL PROTECTED]> on 06/22/2000 06:42:20 PM
>
> To:   "Unicode List" <[EMAIL PROTECTED]>
> cc:   [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Joe
> Ross/Tivoli
>   Systems)
> Subject:  Re: Java, SQL, Unicode and Databases
>
>
>
>
> Jianping responded:
>
> >
> > Tex,
> >
> > Oracle doesn't have special requirement for datatype in JDBC driver if
> you use
> UTF8 as database
> > character set. In this case, all the text datatype in JDBC will support
> Unicode data.
> >
>
> The same thing is, of course, true for Sybase databases using UTF-8
> at the database character set, accessing them through a JDBC driver.
>
> But I think Tex's question is aimed at the much murkier area
> of what the various database vendors' strategies are for dealing
> with UTF-16 Unicode as a datatype. In that area, the answers for
> what a cross-platform application vendor needs to do and for how
> JDBC drivers might abstract differences in database implementations
> are still unclear.
>
> --Ken
>
>
>






Re: Java, SQL, Unicode and Databases

2000-06-23 Thread Joe_Ross



Yes,  version 7. It requires us to use a different data type (nchar) if we want
to store multilingual text as UTF-16. We want our applications to be database
vendor independent so that customers can use any database under the covers. If
all databases supported UTF-8 as an encoding for char, we could support
multilingual data in the same way for all vendors. As it is, we have to use a
different schema for MS SQL server than we do for the others.
Joe


"Tex Texin" <[EMAIL PROTECTED]> on 06/23/2000 11:50:06 AM

To:   Joe Ross/Tivoli Systems@Tivoli Systems
cc:   Unicode List <[EMAIL PROTECTED]>, Hossein Kushki@IBMCA, Vladimir Dvorkin
  <[EMAIL PROTECTED]>, Steven Watt <[EMAIL PROTECTED]>
Subject:  Re: Java, SQL, Unicode and Databases




Joe,

Can you expand on this a bit more? Privately if you prefer.
Do you mean version 7 of MS SQL Server?

I assume if it doesn't have UTF-8, it uses UTF-16. How does this
being the storage encoding, become problematic?
tex


[EMAIL PROTECTED] wrote:
>
> I think that this is also true for DB2 using UTF-8 as the database encoding.
> From an application perspective, MS SQL Server is the one that gives us the
most
> trouble, because it doesn't support UTF-8 as a database encoding for char,
etc.
> Joe
>
> Kenneth Whistler <[EMAIL PROTECTED]> on 06/22/2000 06:42:20 PM
>
> To:   "Unicode List" <[EMAIL PROTECTED]>
> cc:   [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Joe
Ross/Tivoli
>   Systems)
> Subject:  Re: Java, SQL, Unicode and Databases
>
> Jianping responded:
>
> >
> > Tex,
> >
> > Oracle doesn't have special requirement for datatype in JDBC driver if you
use
> UTF8 as database
> > character set. In this case, all the text datatype in JDBC will support
> Unicode data.
> >
>
> The same thing is, of course, true for Sybase databases using UTF-8
> at the database character set, accessing them through a JDBC driver.
>
> But I think Tex's question is aimed at the much murkier area
> of what the various database vendors' strategies are for dealing
> with UTF-16 Unicode as a datatype. In that area, the answers for
> what a cross-platform application vendor needs to do and for how
> JDBC drivers might abstract differences in database implementations
> are still unclear.
>
> --Ken

--


Tex Texin Director, International Products

Progress Software Corp.   +1-781-280-4271
14 Oak Park   +1-781-280-4655 (Fax)
Bedford, MA 01730  USA[EMAIL PROTECTED]

http://www.progress.com   The #1 Embedded Database
http://www.SonicMQ.comJMS Compliant Messaging- Best Middleware
Award
http://www.aspconnections.com Leading provider in the ASP marketplace

Progress Globalization Program (New URL)
http://www.progress.com/partners/globalization.htm


Come to the Panel on Open Source Approaches to Unicode Libraries at
the Sept. Unicode Conference
http://www.unicode.org/iuc/iuc17






Re: Java, SQL, Unicode and Databases

2000-06-23 Thread Joe_Ross



I think that this is also true for DB2 using UTF-8 as the database encoding.
>From an application perspective, MS SQL Server is the one that gives us the most
trouble, because it doesn't support UTF-8 as a database encoding for char, etc.
Joe

Kenneth Whistler <[EMAIL PROTECTED]> on 06/22/2000 06:42:20 PM

To:   "Unicode List" <[EMAIL PROTECTED]>
cc:   [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Joe Ross/Tivoli
  Systems)
Subject:  Re: Java, SQL, Unicode and Databases




Jianping responded:

>
> Tex,
>
> Oracle doesn't have special requirement for datatype in JDBC driver if you use
UTF8 as database
> character set. In this case, all the text datatype in JDBC will support
Unicode data.
>

The same thing is, of course, true for Sybase databases using UTF-8
at the database character set, accessing them through a JDBC driver.

But I think Tex's question is aimed at the much murkier area
of what the various database vendors' strategies are for dealing
with UTF-16 Unicode as a datatype. In that area, the answers for
what a cross-platform application vendor needs to do and for how
JDBC drivers might abstract differences in database implementations
are still unclear.

--Ken