RE: Arabic Script converting Between different Code Pages

2000-06-27 Thread Michael Kaplan (Trigeminal Inc.)

If you are on a Microsoft platform and have the code page support for the
arabic code page, then a simple MultiByteToWideChar call will take care of
it. Here are the code page numbers to use:

Arabic (ASMO 708):  708
Arabic (DOS):   720
Arabic (ISO):   28596
Arabic (Mac):   10004
Arabic (Windows):   1256

The Mac one seem to require windows 2000, although perhaps MLang would allow
the conversion using its libraries if you have it installed (comes with
IE4/IE5, etc.).

Michael

 --
 From: Magda Danish (Unicode)[SMTP:[EMAIL PROTECTED]]
 Sent: Tuesday, June 27, 2000 9:44 AM
 To:   Unicode List
 Subject:  FW: Arabic Script converting Between different Code Pages
 
 
 
 -Original Message-
 From: Akil Fahd [mailto:[EMAIL PROTECTED]]
 Sent: Tuesday, June 27, 2000 8:47 AM
 To: [EMAIL PROTECTED]
 Subject: Arabic Script converting Between different Code Pages
 
 
 Is there a standard method for converting between various Arabic Code
 Pages 
 (ISO 8859-6, Arabic Mac, ASMO 708, ISO 8859-1, Windows-1256)and Unicode
 3.0
 
 Akil Fahd
 
 Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com
 



RE: Java, SQL, Unicode and Databases

2000-06-23 Thread Michael Kaplan (Trigeminal Inc.)

Microsoft is very COM-based for its actual data access methods and COM
uses BSTRs that are BOM-less UTF-16. Because of that, the actual storage
format of any database ends up irrelevant since it will be converted to
UTF-16 anyway.

Given that this is what the data layers do, performance is certainly better
if there does not have to be an extra call to the Windows
MutliByteToWideChar to convert UTF-8 to UTF-16. So from a Windows
perspective, not only is it no trouble, but it also the best possible
solution!

In any case, I know plenty of web people who *do* encode their strings in
SQL Server databases as UTF-8 for web applications, since UTF-8 is their
preference. They are willing to take the hit of "converting themselves"
because when data is being read it is faster to go through no conversions at
all.

Michael

 --
 From: [EMAIL PROTECTED][SMTP:[EMAIL PROTECTED]]
 Sent: Friday, June 23, 2000 7:55 AM
 To:   Unicode List
 Cc:   Unicode List; [EMAIL PROTECTED]
 Subject:  Re: Java, SQL, Unicode and Databases
 
 
 
 I think that this is also true for DB2 using UTF-8 as the database
 encoding.
 From an application perspective, MS SQL Server is the one that gives us
 the most
 trouble, because it doesn't support UTF-8 as a database encoding for char,
 etc.
 Joe
 
 Kenneth Whistler [EMAIL PROTECTED] on 06/22/2000 06:42:20 PM
 
 To:   "Unicode List" [EMAIL PROTECTED]
 cc:   [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Joe
 Ross/Tivoli
   Systems)
 Subject:  Re: Java, SQL, Unicode and Databases
 
 
 
 
 Jianping responded:
 
 
  Tex,
 
  Oracle doesn't have special requirement for datatype in JDBC driver if
 you use
 UTF8 as database
  character set. In this case, all the text datatype in JDBC will support
 Unicode data.
 
 
 The same thing is, of course, true for Sybase databases using UTF-8
 at the database character set, accessing them through a JDBC driver.
 
 But I think Tex's question is aimed at the much murkier area
 of what the various database vendors' strategies are for dealing
 with UTF-16 Unicode as a datatype. In that area, the answers for
 what a cross-platform application vendor needs to do and for how
 JDBC drivers might abstract differences in database implementations
 are still unclear.
 
 --Ken
 
 
 



RE: UTF-8 BOM Nonsense

2000-06-23 Thread Michael Kaplan (Trigeminal Inc.)

Yes, I do feel this way, actually. :-)

The standard is quite clear in its language, one does not have to be a
semanticist to understand that:

1) XML *is* considered to be UTF-8 if there is no BOM and UTF-16 if there
is.
2) The encoding tag was added in recongition of the fact that #1 will not be
enough for some people
3) An XML parser *must* be able to do #1, but #2 is not required.

So I would not discourage the tag, ever. But I would also never do XML that
was not UTF-8 or UTF-16. :-)

Michael


 --
 From: Robert A. Rosenberg[SMTP:[EMAIL PROTECTED]]
 Sent: Friday, June 23, 2000 11:34 AM
 To:   Michael Kaplan (Trigeminal Inc.)
 Cc:   Unicode List
 Subject:  RE: UTF-8 BOM Nonsense
 
 At 11:31 AM 06/22/2000 -0800, Michael Kaplan (Trigeminal Inc.) wrote:
 I do not believe that this will require it to be added to a standard, and
 this is a non-standard usage, but life is about dealing with things as
 they
 are (and this is how they are!).
 
 I assume that you also feel that the charset parm on a MIME Email Header 
 (or HTML/XML header) is not needed and thus should be discouraged. The use
 
 of the BOM character at the start of a TEXT file serves the same purpose
 as 
 the charset tag - It says "I am in UTF-8 format" (so you do not try to 
 treat it as ISO-8859-x, CP1252, or some other encoding format).
 



RE: Java, SQL, Unicode and Databases

2000-06-23 Thread Michael Kaplan (Trigeminal Inc.)

The datatype *does* matter in that sense you would use UTF-16 data
fields (NTEXT and NCHAR and NVARCHAR) and access it with your favorite data
access method, which will convert as needed to whatever format IS uses. You
will never know oc care what the underlying engine stores.

The web site stuff will not work for you since you would have to do the
extra conversions to do the data mining, so you would probably go with plan
"A".

My general point is that OLE DB to an Oracle UTF-8 field and to a SQL Server
UTF-16 field all return the same type of data UTF-16. So COM in this
case is hiding the differences.

Michael

 --
 From: [EMAIL PROTECTED][SMTP:[EMAIL PROTECTED]]
 Sent: Friday, June 23, 2000 2:27 PM
 To:   Michael Kaplan (Trigeminal Inc.)
 Cc:   Unicode List; [EMAIL PROTECTED]
 Subject:  RE: Java, SQL, Unicode and Databases
 
 
 
 Michael, are you saying that the data type (char or nchar) doesn't matter?
 Are
 you saying that if we just use UTF-16 or wchar_t interfaces to access the
 data
 all will be fine and we will be able to store multilingual data even in
 fields
 defined as char? Maybe things aren't as bad as I feared.
 
 With respect to the web applications you describe, do they store the UTF-8
 as
 binary data? This wouldn't work for us, since we want other data mining
 applications to be able to access the same data.
 
 Thanks,
 Joe
 
 "Michael Kaplan (Trigeminal Inc.)" [EMAIL PROTECTED] on 06/23/2000
 10:41:39 AM
 
 To:   Unicode List [EMAIL PROTECTED], Joe Ross/Tivoli Systems@Tivoli
 Systems
 cc:   Hossein Kushki@IBMCA
 Subject:  RE: Java, SQL, Unicode and Databases
 
 
 
 
 Microsoft is very COM-based for its actual data access methods and COM
 uses BSTRs that are BOM-less UTF-16. Because of that, the actual storage
 format of any database ends up irrelevant since it will be converted to
 UTF-16 anyway.
 
 Given that this is what the data layers do, performance is certainly
 better
 if there does not have to be an extra call to the Windows
 MutliByteToWideChar to convert UTF-8 to UTF-16. So from a Windows
 perspective, not only is it no trouble, but it also the best possible
 solution!
 
 In any case, I know plenty of web people who *do* encode their strings in
 SQL Server databases as UTF-8 for web applications, since UTF-8 is their
 preference. They are willing to take the hit of "converting themselves"
 because when data is being read it is faster to go through no conversions
 at
 all.
 
 Michael
 
  --
  From:   [EMAIL PROTECTED][SMTP:[EMAIL PROTECTED]]
  Sent:   Friday, June 23, 2000 7:55 AM
  To: Unicode List
  Cc: Unicode List; [EMAIL PROTECTED]
  Subject: Re: Java, SQL, Unicode and Databases
 
 
 
  I think that this is also true for DB2 using UTF-8 as the database
  encoding.
  From an application perspective, MS SQL Server is the one that gives us
  the most
  trouble, because it doesn't support UTF-8 as a database encoding for
 char,
  etc.
  Joe
 
  Kenneth Whistler [EMAIL PROTECTED] on 06/22/2000 06:42:20 PM
 
  To:   "Unicode List" [EMAIL PROTECTED]
  cc:   [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Joe
  Ross/Tivoli
Systems)
  Subject:  Re: Java, SQL, Unicode and Databases
 
 
 
 
  Jianping responded:
 
  
   Tex,
  
   Oracle doesn't have special requirement for datatype in JDBC driver if
  you use
  UTF8 as database
   character set. In this case, all the text datatype in JDBC will
 support
  Unicode data.
  
 
  The same thing is, of course, true for Sybase databases using UTF-8
  at the database character set, accessing them through a JDBC driver.
 
  But I think Tex's question is aimed at the much murkier area
  of what the various database vendors' strategies are for dealing
  with UTF-16 Unicode as a datatype. In that area, the answers for
  what a cross-platform application vendor needs to do and for how
  JDBC drivers might abstract differences in database implementations
  are still unclear.
 
  --Ken
 
 
 
 
 
 



RE: Bengali: variants of same conjunct

2000-06-22 Thread Michael Kaplan (Trigeminal Inc.)

  Thus since people who write the language sent both,
 cut
 
 Do you mean that Tamil writers *purposely* use both the "ancient" and the
 "modern" forms in the same document?
 What is the intent?
 
yes, that is what am I saying. If you go to several of the Tamil resource
sites on the web, you can see both of them used, often in the same
documents. This is VERY easy to do with the hack fonts, significantly more
difficult if you are using Unicode-enabled fonts.


 And I believe this is entirely a rendering problem that is (far) outside
 Unicode's scope.
 
I do not see how, if BOTH forms are in use and one form is not renderable in
a font that is Unicode compliant, how this would NOT be considered a Unicode
issue. It is crucial that language as used should be possible to render with
Unicode, should it not? The ligatures you mention do not really call into
the same category as the Tamil case, since all of them can be rendered using
the 3.0 (or even the 2.0!) standard.

I do know that the TamilNadu government has specific issues with the Unicode
standard, is this not one of the issues? Or do they prefer only the usage
outlined in the standard, in order to encourage people to use it? And would
this then be a case of the standard being more involved in politics than
might be good?

Michael



RE: UTF-8 BOM Nonsense

2000-06-22 Thread Michael Kaplan (Trigeminal Inc.)

I agree Gary.

Windows 2000 Notepad, however, does not agree and writes one.

Since Notepad in prior versions of Windows was in fact the defacto standard
for HTML editor (g), clearly it is a program to be reckoned with. People
should be aware of the fact that there are going to MANY files out there
that are UTF-8 and do have a BOM.

I do not believe that this will require it to be added to a standard, and
this is a non-standard usage, but life is about dealing with things as they
are (and this is how they are!).

Michael

 --
 From: Gary L. Wade[SMTP:[EMAIL PROTECTED]]
 Sent: Thursday, June 22, 2000 9:08 AM
 To:   Unicode List
 Subject:  UTF-8 BOM Nonsense
 
 Please!
 
 After hundreds of e-mails on this topic, let it die!
 
 The BOM is only useful with UTF-16 or UCS-4 characters.
 
 There is no such thing as byte ordering when each character is a byte or
 a multibyte sequence with a well-documented ordering denoting how to
 interpret this!  For further reference, turn to page 20 in the Unicode
 3.0 book and let us get back to more important things, such as how to
 represent the price of tea in China!  ;-)
 -- 
 Gary L. Wade
 Product Development Consultant
 DesiSoft Systems | Voice:   214-642-6883
 9619 E. Valley Ranch Parkway | Fax: 972-506-7478
 Suite 2125   | E-Mail:  [EMAIL PROTECTED]
 Irving, TX 75063 |
 



RE: Bengali: variants of same conjunct

2000-06-21 Thread Michael Kaplan (Trigeminal Inc.)

Thus far it is something that has been implemented in the fonts, rather than
anywhere else for example there are several ligatures in Tamil that will
display one way with the Latha font and the other way with Monotype Tamil
Arial (the way set out in Unicode 3.0 is done in the latter). 

Thus since people who write the language sent both, they have a not entirely
satisfying workaround specify different fonts for different parts of the
document.

Michael

 --
 From: Christopher John Fynn[SMTP:[EMAIL PROTECTED]]
 Sent: Wednesday, June 21, 2000 11:17 AM
 To:   Unicode List
 Cc:   Arijit Upadhyay
 Subject:  Re: Bengali: variants of same conjunct
 
 This is something which I think needs to be *specified* in the standard.
 This is
 an instance where it should not be left up to implementers to invent their
 own
 character combinations to force one kind of ligature or another.
 
 - Chris
 
 - Original Message -
 From: Arijit Upadhyay [EMAIL PROTECTED]
 To: Unicode List [EMAIL PROTECTED]
 Sent: Wednesday, June 21, 2000 5:37 PM
 Subject: Re: Bengali: variants of same conjunct
 
 
  Thanks
 
  Everybod, now I am in possession of the replies and I find that you have
  found it just right. The gif Abdul has put up is the exact thing I am
  referring to.
 
   http://www.btinternet.com/~abdulmalik/nyu.gif
 
  Now, if I cannot implement the available normal options and Abdul has
  offered a number of options.  What in your opinion would be the best
  possible method to implement this type of variants in present scenario.
  To be honest, I am basically a font/ web designer and just on the
 learning
  of unicode enabliing of Indic fonts.
 
  Thanks again,
  Arijit
 
 



RE: Linguistic precedence [was: (TC304.2313) AND/OR:

2000-06-16 Thread Michael Kaplan (Trigeminal Inc.)

 Well, "Gre" does not appear between "Deu" and "Esp" on any European
 language, but "Gre" does appear between "Ger" and "Spa" so I am assuming
 English names were being used here?
 
Michael

 --
 From: Robert A. Rosenberg[SMTP:[EMAIL PROTECTED]]
 Sent: Thursday, June 15, 2000 1:27 PM
 To:   Unicode List
 Cc:   Unicode List
 Subject:  RE: Linguistic precedence [was: (TC304.2313) AND/OR:
 
 At 07:53 AM 06/15/2000 -0800, Michael Kaplan (Trigeminal Inc.) wrote:
 Eventually someone will have a language name that does not fit
 or a language like German will inist on sorting sooner, under Deutsch
 rather
 than under German, etc. (which I personally think makes more sense than
 making a locale take someone's translation of their language name, FWIW).
 
 Since it was stated that Greek was displayed between German and Spanish, 
 I;d assume that German was Deutsch since Spanish is Espanol (not sure if 
 that "n" is "n" or "ñ" as well as if my spelling is correct).
 



RE: Linguistic precedence [was: (TC304.2313) AND/OR: antediluvian

2000-06-15 Thread Michael Kaplan (Trigeminal Inc.)


  On the cover of my French driver's license, it says ``Driving
  license'' in 10 languages (all the EU languages at the time it was
  printed).  The titles are ordered alphabetically by the name of the
  language in the language itself.  The Portuguese don't seem to mind.
  
  (Fair enough, this only works because all bar one of the EU languages
  use the Latin script, and Greek has a standard transliteration.  It
  comes between German and Spanish.)
 
Sounds great...but *which* alphabetic ordering are we talking
about? :-)

Actually, in the case of the 10 EU languages being referred to, I do not
think there would be any dissention as to the order, would there be?
Admittedly if Lithuania was in the EU and there were countries that started
with a "Y" there as well, there would be problems with people who did not
understand the order and thought that the "Y" countries were jumping the
queue, but AFAIK there is no real difference except for Greek, and no one
disagress with its placement.

Michael





RE: Linguistic precedence [was: (TC304.2313) AND/OR: antediluvian

2000-06-15 Thread Michael Kaplan (Trigeminal Inc.)

I admit to nitpicking because in this particular case, the language
names,
we may be just lucky so that there are no collation conflicts.

I believe this is an accurate statement... .we ARE lucky, so far.

But believing that there is a collation order that works across all
the
European languages is a very hopeless fallacy

I agree. Eventually someone will have a language name that does not fit
or a language like German will inist on sorting sooner, under Deutsch rather
than under German, etc. (which I personally think makes more sense than
making a locale take someone's translation of their language name, FWIW).
People can always find something to fight about, sometimes we help them by
not frustrating them and making them look quite so hard. :-)

 (Has somebody written a comprehensive collection of all these collation
 problems?)
 
Well, the only ones I regularly deal with are the Traditional Spanish sort
and the Lithuanian sort, there are undoutably others. The accent/grave
issues are easier and few people would really object to sorts that place the
characters with diacriticals right after the base characters. So it is
really only DIFFERENT sorts that make people upset.

 And, in future, Lithuania may be a member in the EU.  (Where *do* they
 sort
 'Y', by the way?)
 
They sort "Y" after "I". The second i18N bug in a software product that I
ever had to fix was relaetd to this issue. :-)

Michael