Re: per-character "stories" in a database

2003-03-14 Thread jameskass
.
William Overington wrote,

> I find it strange that the Unicode Standard does not codify the 
> ligatures which can be produced with the languages of the Indian 
> subcontinent at display time using specific sequences of regular 
> Unicode characters so that someone skilled in the art of font design 
> may design a font from the code charts.

"Codify" means to arrange systematically, and no one should think
that William is suggesting that such glyphs be encoded.

Quoting from Unicode's page at:

http://www.unicode.org/pending/proposals.html


The Unicode Consortium is interested in obtaining information on known 
glyphs, minor variants, precomposed characters (including ligatures, 
conjunct consonants, and accented characters) and other such 
"non-characters," mainly for cataloging and research purposes; 
however, they are generally not acceptable for character proposals.  


As long as Unicode is obtaining this information, it would be helpful
if it could be published.

Best regards,

James Kass
.



Unicode Public Review Issues update

2003-03-14 Thread Rick McGowan
The Unicode "Public Review Issues" page has been updated today.

Highlights:

Closed issue #1 (Language tag deprecation) without any change.
Updated some deadlines on other issues to June 1, 2003.
Added a document for issue #7 (tailored normalizations).
Added an issue #8 regarding properties of math digits.

Regards,
Rick McGowan
Unicode, Inc.




Re: per-character "stories" in a database

2003-03-14 Thread Michael Everson
At 16:08 + 2003-03-14, William Overington wrote:

I find it strange that the Unicode Standard does not codify the 
ligatures which can be produced with the languages of the Indian 
subcontinent at display time using specific sequences of regular 
Unicode characters so that someone skilled in the art of font design 
may design a font from the code charts.
Perhaps you should reread the Unicode Standard, then. It is out of scope.

Having said that, you could probably commission someone like me to 
provide such a list. I do have an exhaustive list of Devanagari 
conjuncts in one of my files, for instance.

No one would undertake such work for free, methinks.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com


Re: New document.

2003-03-14 Thread Kenneth Whistler

> Otto Stolz wrote:
> 
> >
> > The two scans under
> >   http://www.rz.uni-konstanz.de/Antivirus/tests/li.png
> >   http://www.rz.uni-konstanz.de/Antivirus/tests/re.png
> > are from the authoritative (until July 1996) book on German
> > orthography: Duden "Rechtschreibung der deutschen Sprache
> > und der Fremdwörter" / hrsg. von d. Dudenred. auf d. Grundlage
> > d. amtl. Rechtschreibregeln. [Red.Bearb.: Werner Scholze-
> > Stubenrecht unter Mitw. von Dieter Berger ...]. - 19., neu bearb.
> > u. erw. Aufl. ISBN: 3-411-20900-3.
> >
> > Best wishes,
> >   Otto Stolz 
> 
> could you point out which symbol in that two images need to be proposed?
> either by using red ciricle on the image or tell use the surrounding text.
> Thanks

> >   http://www.rz.uni-konstanz.de/Antivirus/tests/li.png

* geboren   [born] -- asterisk, already encoded
(*) außerehelich geboren  [born in wedlock] -- parentheses and
  asterisk, already encoded
{ }* tot geboren [born dead]  -- dagger and asterisk, already encoded
*{ } am Tag der Geburt gestorben [died on day of birth] -- asterisk
  and dagger, already encoded
{ } getauft [baptized] -- could use U+3030 WAVY DASH

> >   http://www.rz.uni-konstanz.de/Antivirus/tests/re.png

{ } verlobt [engaged] -- could use an existing circle character
{ } verheiratet [married] -- needs new symbol
{ } geschieden [divorced] -- needs new symbol
{ } außereheliche Verbindung [common law marriage ?] -- needs new symbol
{ } gestorben [died]  -- dagger, already encoded
{ } gefallen [killed (in battle?)]  -- needs new crossed swords symbol
{ } begraben [buried] -- needs new coffin symbol
{ } eingeäschert [cremated] -- needs new urn symbol
 (although I suppose people could use one of the 29
  new Linear B vessel ideograms to indicate just what
  kind of urn they were buried in. ;-) )

--Ken




Re: pinyin syllable `rua'

2003-03-14 Thread Yung-Fong Tang
Which pinyin system the "rua" is in?

I use simpchinese win XP and if I switch to Full Spell (??)Simplified 
Chinese IME and type "rua', then I got "挼" (read this email in UTF-8) 
which is U+633C
I am not sure that is correct. At least, as a native Mardarin speaker, 
that sound is not nature for me at all. It could be a table mistake in 
the software. It sound like Japanese :)

Werner LEMBERG wrote:

Some lists of pinyin syllables contain `rua', but I actually can't
find any Chinese character with this name.
Does it exist at all?  Or is it just there for completeness of pinyin?

   Werner

 






Re: Unicode 4.0 chapter headings and numbering.

2003-03-14 Thread Kenneth Whistler
William Overington asked:

> I wonder if you could please say whether the Unicode 4.0 book will have the
> same chapter headings and numbering as the Unicode 3.0 book?

They will be largely similar -- and identical for Chapters 1 through
5 -- but there are various reorganizations in the latter part
of the book, to account for all the additions to the standard.

> I would like
> simply to refer readers to Chapter 9 South and Southeast Asian Scripts of
> the Unicode Standard at the http://www.unicode.org webspace.

For Unicode 4.0:

Chapter 9: South Asian Scripts
Chapter 10: Southeast Asian Scripts

> 
> Also, if unchanged, is that a matter of continuing stability for future
> issues as well, or is it just for Unicode 4.0 please?

For book references, you can expect changes in the future as
well. There are no guarantees, from version to version, of the
exact contents and scope of each chapter in the book. For
people who want stability of references, you should simply
point to the standard as a whole, or make references such
as "See the discussion of the Devanagari script in the
Unicode Standard, online at ..." Then let your readers navigate
the online version, via its Table of Contents and/or Index,
to find the section they are interested in.

--Ken




Unicode Conference Reminder

2003-03-14 Thread Lisa Moore




Folks,

Getting down to the wire, time to get your registration in.  Join us in
Prague!

Lisa


*
 Register now! > Just 1 week to go! > Register now! > Just 1 week to go!
*
NEWSFLASH: Bring your books to the Conference! (see below)

 Twenty-third Internationalization and Unicode Conference (IUC23)
  Unicode, Internationalization, the Web: The Global Connection
http://www.unicode.org/iuc/iuc23
   March 24-26, 2003
Prague, Czech Republic
*

NEWS

 > Meet industry leaders in Prague and discuss the hot topics for
   internationalization and Unicode in 2003! Check out the updated
   Conference program ( http://www.unicode.org/iuc/iuc23/program.html )
   which includes abstracts of talks and speakers' biographies.
   Register while there's still time via the Conference Web site at
   http://www.unicode.org/iuc/iuc23/registration.html .

 > Sign up for the Workshop on Managing Localization Projects, organized
   by XenCraft, and taking place in the same venue on 27 March -- See:
   http://www.unicode.org/iuc/iuc23

 > Bring your books to the Prague Internationalization and Unicode
Conference!
   Internationalization industry pundits Don DePalma, Bill Hall, and
Michael
   Kaplan will be available for book signings at noon in the SHOWCASE
   Exhibition area on Tuesday March 25. Bring your copy of their books or
   articles to the conference to have them autographed, or pick up their
latest
   works right at the SHOWCASE. This is your chance to meet and greet these
   authors.

 > Attend the new Showcase to find out more about products supporting
   the Unicode Standard, and products and services that can help you
   globalize/localize your software, documentation and Internet content.

 > Be an Exhibitor! Show off your product at the premier technical
conference
   worldwide for both software and Web internationalization.
   See: http://www.unicode.org/iuc/iuc23/showcase.html


CONFERENCE PROGRAM

   The conference features tutorials, lectures, and panel discussions
   that provide coverage of standards, best practices, and recent
   advances in the globalization of software and the Internet.
   See the program: http://www.unicode.org/iuc/iuc23/program.html


GLOBAL COMPUTING SHOWCASE

   For the first time, we will have an Exhibitors' track as part of the
   Conference, in addition to the updated Showcase Exhibition.

   For more information, please visit the Web site at:
   http://www.unicode.org/iuc/iuc23/showcase.html

   Showcase participants include:

   Agfa Monotype Corporation
   Alchemy Software Development Ltd.
   Basis Technology Corporation
   Moravia IT
   Multilingual Computing, Inc.


CONFERENCE VENUE & ACCOMODATION

The Conference will take place in lovely, historic Prague:

 Marriott Prague Hotel
 V Celnici 8
 Prague, 110 00
 Czech Republic

 Tel:  (+420 2) 2288 
 Fax:  (+420 2) 2288 8889

 For a selection of hotels in the area see:
 http://www.unicode.org/iuc/iuc23/accommodation.html

 For travel information see:
 http://www.unicode.org/iuc/iuc23/travel.html


CONFERENCE SPONSORS

   Agfa Monotype Corporation
   Basis Technology Corporation
   Microsoft Corporation
   Moravia IT
   Sun Microsystems, Inc.
   World Wide Web Consortium (W3C)


CONFERENCE MANAGEMENT

   Global Meeting Services Inc.
   8949 Lombard Place, #416
   San Diego, CA 92122, USA

   Tel: +1 858 638 0206 (voice)
+1 858 638 0504 (fax)

   Email: [EMAIL PROTECTED]
  or: [EMAIL PROTECTED]



THE UNICODE CONSORTIUM

The Unicode Consortium was founded as a non-profit organization in
1991. It is dedicated to the development, maintenance and promotion
of The Unicode Standard, a worldwide character encoding.  The Unicode
Standard encodes the characters of the world's principal scripts and
languages, and is code-for-code identical to the international
standard ISO/IEC 10646. In addition to cooperating with ISO on the
future development of ISO/IEC 10646, the Consortium is responsible
for providing character properties and algorithms for use in
implementations. Today the membership base of the Unicode Consortium
includes major computer corporations, software producers, database
vendors, research institutions, international agencies and various
user groups.

For further information on the Unicode Standard, visit the Unicode
Web site at http://www.unicode.org or e-mail <[EMAIL PROTECTED]>

   *  *  *  *  *

Unicode(r) and the Unicode logo are registered trademarks of Unicode,
Inc. Used with permission.







Re: New document.

2003-03-14 Thread Yung-Fong Tang


Otto Stolz wrote:

The two scans under
  http://www.rz.uni-konstanz.de/Antivirus/tests/li.png
  http://www.rz.uni-konstanz.de/Antivirus/tests/re.png
are from the authoritative (until July 1996) book on German
orthography: Duden "Rechtschreibung der deutschen Sprache
und der Fremdwörter" / hrsg. von d. Dudenred. auf d. Grundlage
d. amtl. Rechtschreibregeln. [Red.Bearb.: Werner Scholze-
Stubenrecht unter Mitw. von Dieter Berger ...]. - 19., neu bearb.
u. erw. Aufl. ISBN: 3-411-20900-3.
Best wishes,
  Otto Stolz 
could you point out which symbol in that two images need to be proposed?
either by using red ciricle on the image or tell use the surrounding text.
Thanks




Re: Characters that rotate in vertical text

2003-03-14 Thread Yung-Fong Tang




I think that is a hard problem

First of all. Take a look at 
http://www.unicode.org/Public/4.0-Update/UCD-4.0.0d5b.html
and find the  one

Second, anything which need to be Symmetric Swap in Bidi probably need to
be change in the vertical form. (If they need to be change in horizontal
direction, they probably will need to be change in the vertical position)

However, this is not that easy. First, there are some characters could be
rotate as optionl. For example, if you have English string "Book" in your
vertical text, should software rotate it? or not?
It could rotate the whole text 90 as "Book", or it could displayed as
B
o
o
k

Both are "right". It depend on the application domain to decide how to display
it. Which mean it need "a higher level protocol" look at the example in the
session of 3.3 of http://www.w3.org/TR/2003/WD-css3-text-20030226/

Second, it also depend on the people who design the glyph. For example, U+FF0C
in a Traditional font have the comma in the central position- which mean
it don't need to be change in the vertical layout. However, Japanese users
think that position is funny for horizontal text and won't accept that. So
the U+FF0C glyph in a Japanese font will be put in the left lower corner.
and in that case, it need a different Glyph (note, not different unicode,
but a different glyph id) to represent it in the Vertical layout. That is
way you see on the Window system most of the font have a "@ variant" version
there. That font is used for Veritcal layout and the same unicode map to
different glyph id (so the , show up in the left upper position, center position,
or right upper posiont [I  am not a typographer so I am not sure which one
they choose, but one of them})

More info about Vertical text could be found at the following places
1. page 342-365, Chapter 7, Typography, CJKV Information Processing, Ken
Lunde, O'Reilly, ISBN 1-56592-224-7,  http://www.oreilly.com/catalog/cjkvinfo/
2. page 192-193, Developing International Software- 2nd Edition, Dr. International,
Microsoft Press, ISBN 0-7356-1583-7 http://www.microsoft.com/mspress/books/5717.asp
They may have an online copy on the msdn




Rick Cameron wrote:
 
  
  
 
  
  Characters that rotate in vertical text

  Hi, all 
  
  When Japanese (and, I imagine, other
East Asian languages) is written vertically, certain characters are rotated
by 90 degrees. Examples: the parenthesis-like characters in the block at
U+3000, and U+30FC.

U+3000 is SPACE characters, I don't think it will need to be rotate, it should
show BLANK anywan.

 
 
  Does the Unicode character database
include information on which characters are rotated in vertical text? If
not, does anyone know of a definitive list?
  
  Thanks 
  
  - rick cameron 
  





Re: per-character "stories" in a database (derives from Re: geometric shapes)

2003-03-14 Thread John Hudson
At 08:08 AM 3/14/2003, William Overington wrote:

I find it strange that the
Unicode Standard does not codify the ligatures which can be produced with
the languages of the Indian subcontinent at display time using specific
sequences of regular Unicode characters so that someone skilled in the art
of font design may design a font from the code charts.
Someone skilled in the art of font design doesn't need to see all possible 
ligatures reproduced in a code chart. Trust me, anyone designing typefaces 
for Indic scripts understands the difference between a character and a glyph.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
It is necessary that by all means and cunning,
the cursed owners of books should be persuaded
to make them available to us, either by argument
or by force.  - Michael Apostolis, 1467



Re: New document.

2003-03-14 Thread Mark Davis
The other point is that the images need to be sufficient to show that they
are photocopies of real documents, with citations for the sources.

Mark

[EMAIL PROTECTED]
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799

- Original Message -
From: "Rick McGowan" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, March 14, 2003 09:08
Subject: Re: New document.


> Otto Stolz pointed to:
>http://www.rz.uni-konstanz.de/Antivirus/tests/li.png
>http://www.rz.uni-konstanz.de/Antivirus/tests/re.png
>
> Yes, some of those are worth proposing, I would say.
>
> The in-line textual usage is pretty well given by the two images. But
> someone needs to investigate the meanings and supply reasonable character
> names. Then fill out the proposal summary form and submit it.
>
> If someone does a proposal, that would be nice. It won't be me, since I
> don't have any access to the German sources and don't read German.
>
> For reference, Michael Everson recently did a very nice, succinct proposal
> for Guarani and Austral signs. You could use that proposal as a model. It
> was reviewed (and accepted) by UTC last week. Here is a pointer to the
> Guarani proposal:
>
> http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2579.pdf
>
> Cheers,
> Rick
>
>




Re: Unicode character transformation through XSLT

2003-03-14 Thread Markus Scherer
Nooo - Java's old "UTF" functions do not process UTF-8! They are there for String serialization, a 
Java-internal format.
Use the Java Reader/Writer classes instead of these old ones!

See the Java tutorials on Internationalization:
http://java.sun.com/docs/books/tutorial/i18n/text/convertintro.html
http://java.sun.com/docs/books/tutorial/i18n/text/index.html
http://java.sun.com/docs/books/tutorial/i18n/index.html
See the descriptions of readUTF() functions (highlighting with ***):

http://java.sun.com/j2se/1.4/docs/api/java/io/DataInputStream.html#readUTF(java.io.DataInput)

"Reads from the stream in a representation of a Unicode character string encoded in ***Java modified 
UTF-8*** format; this string of characters is then returned as a String. The details of the 
***modified UTF-8*** representation are exactly the same as for the readUTF  method of DataInput."

http://java.sun.com/j2se/1.4/docs/api/java/io/DataInput.html#readUTF()

Java's *modified* UTF-8 in its "UTF" functions resembles CESU-8, and writes U+ with two bytes 
instead of one, as far as I remember.

markus

Yung-Fong Tang wrote:
what is rsResult? Blob?
you probably need to use
BufferedInputStream

and

DataInputStream

 to pipe the InputStream
and use readChar or readUTF in the InputStream interface instad.
See http://www.webdeveloper.com/java/java_jj_read_write.html and
http://java.sun.com/j2se/1.4/docs/api/java/io/DataInputStream.html#readUTF() 
for more info.
--
Opinions expressed here may not reflect my company's positions unless otherwise noted.



RE: Need encoding conversion routines

2003-03-14 Thread Marco Cimarosti
askq1 askq1 wrote:
> >From: "Pim Blokland" <[EMAIL PROTECTED]>
> 
> >However, you have said this is not what you want!
> >So what is it that you do want?
> 
> I want c/c++ code that will give me UTF8 byte sequence 
> representing a given code-point,
> UTF16 16 bits sequence reppresenting a given 
> code-point, UTF32 
> 32 bits sequence representing a given code-point.
> 
> e.g.
> 
> UTF8_Sequence CodePointToUTF8(Unichar codePoint)
> {
> //I need this code
> }
> 
> UTF16_Sequence CodePointToUTF16(Unichar codePoint)
> {
> //I need this code
> }
> 
> UCS2_Sequence CodePointToUCS2(Unichar codePoint)
> {
> //I need this code
> }

Hint:

#include "ConvertUTF.h"
typedef UTF32 Unichar;
typedef UTF8  UTF8_Sequence  [4 + 1];
typedef UTF16 UTF16_Sequence [2 + 1];
typedef UTF16 UCS2_Sequence  [1 + 1];

_ Marco



Re: Need encoding conversion routines

2003-03-14 Thread Markus Scherer
Let's try this:

ICU has C header files with macros for code point handling in UTF-8/16 strings. See the utf8.h and 
utf16.h headers (together with utf.h) in ICU's source tree at source/common/unicode/.

http://oss.software.ibm.com/icu/download/
http://oss.software.ibm.com/cvs/icu/icu/source/common/unicode/
There is also a utf32.h header, but that is empty now. I redesigned the set of macros last year to 
simplify and improve them a bit.

Specifically, see below.

(Note that the UTF-8 macros [except for the "unsafe" ones] handle the complicated cases in functions 
that are called from inside the macros. See source/common/utf_impl.c . Safe UTF-8 handling requires 
a lot of error checks.)

askq1 askq1 wrote:
I want c/c++ code that will give me UTF8 byte sequence representing a 
given code-point, UTF16 16 bits sequence reppresenting a given 
code-point, UTF32 32 bits sequence representing a given code-point.

e.g.

UTF8_Sequence CodePointToUTF8(Unichar codePoint)
Use U8_APPEND().
http://oss.software.ibm.com/icu/apiref/utf8_8h.html#a12
To read a code point from UTF-8, use U8_NEXT()
http://oss.software.ibm.com/icu/apiref/utf8_8h.html#a10
or U8_GET() etc.

UTF16_Sequence CodePointToUTF16(Unichar codePoint)
U16_APPEND()
http://oss.software.ibm.com/icu/apiref/utf16_8h.html#a16
To read a code point from UTF-8, use U16_NEXT()
http://oss.software.ibm.com/icu/apiref/utf16_8h.html#a16
or U16_GET() etc.

UCS2_Sequence CodePointToUCS2(Unichar codePoint)
For UCS-2, the best strategy (in my opinion) is to treat it exactly the same as UTF-16. Most people 
mean UTF-16 when they talk about UCS-2 or generally about "16-bit Unicode".

If you do want to distinguish them anyway, then this is trivial:
if(0<=codePoint<=0x) {
cast codePoint to 16-bit type and emit;
} else {
error;
}
Similarly, UTF-32 is trivial as well - it just stores each code point value in a 32-bit integer 
unit. Unicode code points are values 0..0x10.

See also http://oss.software.ibm.com/cvs/icu/~checkout~/icu/source/samples/ustring/ustring.cpp

I hope this helps - best regards,
markus
--
Opinions expressed here may not reflect my company's positions unless otherwise noted.



Re: Need encoding conversion routines

2003-03-14 Thread Edward H Trager

If you need a utility to do these conversions for you right away, take a
look at "uniconv" which is part of Gaspar Sinai's Yudit unicode editor.
This is an Open Source program, so you can look at the code too:

http://www.yudit.org

For C++ and Java libraries, check out IBM's International Components for
Unicode, which is also under an Open Source license:

http://oss.software.ibm.com/icu/

ICU has a lot more to offer besides just charset and encoding conversion.
Definitely worth taking a look at!

On Fri, 14 Mar 2003, askq1 askq1 wrote:

> I want c/c++ code that will give me UTF8 byte sequence representing a given
> code-point, UTF16 16 bits sequence reppresenting a given code-point, UTF32
> 32 bits sequence representing a given code-point.




Re: New document.

2003-03-14 Thread Rick McGowan
Otto Stolz pointed to:
   http://www.rz.uni-konstanz.de/Antivirus/tests/li.png
   http://www.rz.uni-konstanz.de/Antivirus/tests/re.png

Yes, some of those are worth proposing, I would say.

The in-line textual usage is pretty well given by the two images. But  
someone needs to investigate the meanings and supply reasonable character  
names. Then fill out the proposal summary form and submit it.

If someone does a proposal, that would be nice. It won't be me, since I  
don't have any access to the German sources and don't read German.

For reference, Michael Everson recently did a very nice, succinct proposal  
for Guarani and Austral signs. You could use that proposal as a model. It  
was reviewed (and accepted) by UTC last week. Here is a pointer to the  
Guarani proposal:

http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2579.pdf

Cheers,
Rick



Re: pinyin syllable `rua'

2003-03-14 Thread Thomas Chan
On Fri, 14 Mar 2003, Werner LEMBERG wrote:

> Some lists of pinyin syllables contain `rua', but I actually can't
> find any Chinese character with this name.

Not all words or utterances necessarily have written forms, but...

 
> Does it exist at all?  Or is it just there for completeness of pinyin?

...I found a rua2 in the _Xiandai Hanyu Cidian_, U+633C, glossed as 1)
"(zhi huo bu) zhou" ((paper or cloth) wrinkle) and 2) "kuai yao po" (about
to break).  It's marked "fang" (dialect).  There's a pointer to the same
character, pronounced ruo2, which in turn is glossed as "roucuo" (to rub),
and marked "shu" (bookish).


Thomas Chan
[EMAIL PROTECTED]





Re: pinyin syllable `rua'

2003-03-14 Thread Andrew C. West
On Fri, 14 Mar 2003 08:43:19 -0800 (PST), Werner LEMBERG wrote:

> Some lists of pinyin syllables contain `rua', but I actually can't
> find any Chinese character with this name.
> 
> Does it exist at all?  Or is it just there for completeness of pinyin?

This is not really a Unicode question, and there are probably other forums which
are more qualified to pontificate on the idiosyncrasies of Chinese
pronunciation. But, for what its worth, ...

There are no characters in the Unihan database that are given a Mandarin pinyin
reading of "rua". However, as has previously been pointed out on this list, the
Mandarin readings given in the Unihan database are somewhat erratic, and should
not be relied on as a definitive authority.

According to one pinyin chart on the internet
(http://www.cohums.ohio-state.edu/deall/jin.3/c231/refs/p2w.htm) "rua" is an
"oral or dialectal syllable", and as such probably does not represent a standard
Mandarin pronunciation. The only character that I can find that has a reading of
"rua" is U+633C (which is given readings of NUO4 and RUO2 in the Unihan
database). See for example the list of pinyin readings for GBK characters given
at http://input.foruto.com/gbqpxdm/hpbig5gzl.htm which gives "rua" as one of
four readings for U+633C (luo, rua, ruo, sui). No other character in this list
is given a reading of "rua".

Andrew



Re: Need encoding conversion routines

2003-03-14 Thread Otto Stolz
askq1 askq1 wrote:

Actually my requirement is striaght-forward/common and I believe it 
should be available somewhere on net.
In particular I need source code (or some way) for following requirements:
- Convert Unicode code-point to UTF8 encoding and vice-versa.
- Convert Unicode code-point to UCS2 encoding and vice-versa.
- Convert Unicode code-point to UTF16 encoding and vice-versa.
http://czyborra.com/utf/
ftp://www.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c
Cheers,
  OS



Re: per-character "stories" in a database (derives from Re: geometric shapes)

2003-03-14 Thread William Overington
Markus Scherer wrote as follows.

quote

It has been suggested many times to build a database (list, document, XML,
...) where each designated/assigned code point and each character gets its
"story": Comments on the glyphs, from what codepage it was inherited, usage
comments and examples, alternate names, etc.

I am talking about both code points and "characters" on purpose, and I
would go a step beyond documenting what's there. All the "characters" that
can be represented by a sequence of assigned Unicode characters should be
listed, with that sequence (or those sequences), and with further
explanation if necessary.

end quote

Yes, that is a very good point.  I have become interested in the languages
of the Indian subcontinent from the standpoint of trying to ensure that they
can be displayed properly using interactive television using portable font
technology, however I am not a linguist and I find it strange that the
Unicode Standard does not codify the ligatures which can be produced with
the languages of the Indian subcontinent at display time using specific
sequences of regular Unicode characters so that someone skilled in the art
of font design may design a font from the code charts.

Later he wrote.

quote

Now we just need to
- find someone to sponsor this effort technically and with humanpower
- squeeze the existing information out of the standard, the mailing lists,
FAQs, and of course out of the Unicode veterans before they retire by
Unicode 6...

end quote

Well, how about an approach like Project Gutenberg uses for proofreading
transcripts of classic books.  If there were a database where people could
post items about particular characters and people could read them and either
confirm what is said or put some other view or just add some other
information, then maybe the database could just sort of gradually become
generated over a period of years.  How big would that be?  About 100
thousand code points at, say, 200 words for each on average at about 5 or 6
characters per word on average with a space following each word would be
about 130 megabytes in total.  I fully realize that the phrase "sort of
gradually" might easily be quoted in a response to this posting, yet if the
database facility were there, accessible directly from the web, there may
well be many people who would stop by for a while and review what has been
entered and add a little more to the database.

>PS: Sorry, I am not in a position to volunteer...

Well, it could be more of an informal thing.  If the facility were set up,
then people who are interested could simply visit the web site when they
felt like participating.  Certainly there might be a core of people who had
the ability to throw out rubbish and to convert fragments of text into a
good English narrative so that there was some overall structure to it all,
yet it does not necessarily need to be as formal and rigid as if it were a
commercial project with a time deadline, particularly if the alternative is
that it does not get done at all.

William Overington

14 March 2003














Re: Need encoding conversion routines

2003-03-14 Thread askq1 askq1
From: "Pim Blokland" <[EMAIL PROTECTED]>

However, you have said this is not what you want!
So what is it that you do want?
I want c/c++ code that will give me UTF8 byte sequence representing a given 
code-point, UTF16 16 bits sequence reppresenting a given code-point, UTF32 
32 bits sequence representing a given code-point.

e.g.

UTF8_Sequence CodePointToUTF8(Unichar codePoint)
{
   //I need this code
}
UTF16_Sequence CodePointToUTF16(Unichar codePoint)
{
   //I need this code
}
UCS2_Sequence CodePointToUCS2(Unichar codePoint)
{
   //I need this code
}
Thanks,
~ K.
_
Cricket World Cup 2003- News, Views and Match Reports. 
http://server1.msn.co.in/msnspecials/worldcup03/




pinyin syllable `rua'

2003-03-14 Thread Werner LEMBERG

Some lists of pinyin syllables contain `rua', but I actually can't
find any Chinese character with this name.

Does it exist at all?  Or is it just there for completeness of pinyin?


Werner



Re: Need encoding conversion routines

2003-03-14 Thread Pim Blokland
askq1 askq1 schreef:

> Character U+4321 is the unicode code-point but to store this
character into
> a file we need to use a certain encoding format.

Yes. That depends on the implemention. If your character is kept in
memory as a 16 bits type, that's simply an short integer with the
hex value 0x4321, or decimal 17185. (Whether this is signed or
unsigned, little-endian or big-endian doesn't matter.) Now if you
want to convert this, you call the appropriate conversion routine
from the CVTUTF library.
E.g. if you need UTF-8 output, you supply the ConvertUTF16toUTF8
function with pointers to this character and your output buffer, and
you end up with the bytes 0xE4, 0x8C, 0xA1 in your output buffer.
You can then dump this buffer to the file you mentioned.
However, you have said this is not what you want!
So what is it that you do want?

Pim Blokland




RE: per-character "stories" in a database

2003-03-14 Thread Dominikus Scherkl
> It has been suggested many times to build a database (list, 
> document, XML, ...) where each 
> designated/assigned code point and each character gets its 
> "story": Comments on the glyphs, from 
> what codepage it was inherited, usage comments and examples, 
> alternate names, etc.
> 
> I am talking about both code points and "characters" on 
> purpose, and I would go a step beyond 
> documenting what's there. All the "characters" that can be 
> represented by a sequence of assigned 
> Unicode characters should be listed, with that sequence (or 
> those sequences), and with further 
> explanation if necessary.

But doesn't we have such - at least for newer additions?
Because, the proposals to add new characters offer almost
large background (only to increase the chance to realy
become the character added?!? :-).
Where do all this wonderful information goes after adding?

Best regards.
-- 
Dominikus Scherkl
[EMAIL PROTECTED]



Re: Need encoding conversion routines

2003-03-14 Thread askq1 askq1
From: "Pim Blokland" <[EMAIL PROTECTED]>
To: "Unicode mailing list" <[EMAIL PROTECTED]>
Subject: Re: Need encoding conversion routines
Date: Fri, 14 Mar 2003 12:30:44 +0100
askq1 askq1 schreef:

> In particular I need source code (or some way) for following
requirements:
> - Convert Unicode code-point to UTF8 encoding and vice-versa.
> - Convert Unicode code-point to UCS2 encoding and vice-versa.
> - Convert Unicode code-point to UTF16 encoding and vice-versa.
Ahem. Unicode *IS* UTF-8, UTF-16 and UCS-2. For instance, codepoint
U+4321 has the value (hex) 4321, which is defined as its Unicode
value. This is the same in any encoding. So I'm not sure what you
want. If the C routines at
http://www.unicode.org/Public/PROGRAMS/CVTUTF/ don't do it for you,
which conversion do you need? LE byte order to BE and back?
Canonical decomposing? Fallback character substitutions? BOM
insertion? What?
Yes I agree to what you are saying above. Let em explain what I want.
Character U+4321 is the unicode code-point but to store this character into 
a file we need to use a certain encoding format.
e.g. There must be some algorithm to find *the sequence of bytes* that 
represent this character into *UTF8 encoding*. Similar algorithms must be 
there for UTF16 and UCS2 encodings, I want C implementation of such 
algorithms.

Thanks,
~ K.
Pim Blokland


_
Cricket - World Cup 2003 http://server1.msn.co.in/msnspecials/worldcup03/ 
News, Views and Match Reports.




Re: Need encoding conversion routines

2003-03-14 Thread Pim Blokland
askq1 askq1 schreef:

> In particular I need source code (or some way) for following
requirements:
> - Convert Unicode code-point to UTF8 encoding and vice-versa.
> - Convert Unicode code-point to UCS2 encoding and vice-versa.
> - Convert Unicode code-point to UTF16 encoding and vice-versa.

Ahem. Unicode *IS* UTF-8, UTF-16 and UCS-2. For instance, codepoint
U+4321 has the value (hex) 4321, which is defined as its Unicode
value. This is the same in any encoding. So I'm not sure what you
want. If the C routines at
http://www.unicode.org/Public/PROGRAMS/CVTUTF/ don't do it for you,
which conversion do you need? LE byte order to BE and back?
Canonical decomposing? Fallback character substitutions? BOM
insertion? What?

Pim Blokland





Re: Need encoding conversion routines

2003-03-14 Thread jameskass
.
Pim Blokland wrote in response to "K.",

> > another. (UTF8 to UTF16 etc.) But how can I convert from a
> > unicode code-point to some encoding and decode from some
> > encoding to unicode code-point? In brief, I want encoding
> > and decoding functions for Unicode.
> 
> What is "some encoding"? You mean codepage-based 8-bit character
> sets?
> I once wrote a conversion routine, under Windows, that can convert
> from UCS-2 to any codepage Windows supports and back, by using the
> info in the appropriate *.nls files.
> If this is what you want, and you can't find it elsewhere, I can
> upload those routines somewhere if you like.
> 

One method of doing this is to simply make the source file as HTML,
open it in the Internet Explorer browser with the [View] - [Encoding]
set to the appropriate encoding, and then [File] - [Save As]
"Unicode (UTF-8)".  That's pretty painless.

In fact, [File] - [Save As] "User Defined" will even save the
page using Unicode NCRs rather than UTF-8.

It should work both ways, but, when saving from Unicode to any
older code page, any material not covered by the older code
page would be lost.

Best regards,

James Kass
.



Re: Tolkien wanta-be has created entirely new language

2003-03-14 Thread jameskass
.
C.T.M.  Jacobs wrote,

> And yet Tengwar seems to redefine the ASCII range ...
> 
> http://www.gis.net/~dansmith/fonts/tengwar.htm
> 

Not Tengwar itself, of course, but rather the various font developers
who have made these "custom-encoded" fonts.

The less charitable among us call such fonts "hack fonts".

Many examples exist, and such redefinition of the ASCII ranges
predates the Unicode Standard.  During the 1980's, before the
Unicode Standard, and even into the 90's, when Unicode support
was minimal, font developers and end users really didn't have
any other option.  Such custom encoding was quite common.

Now that we all know better (smile) we should avoid making
custom encoded fonts.  The developers of such fonts (who are
still making and maintaining these fonts) should be urged to
make conformant fonts.  Custom encoded fonts will be phased
out, eventually.  Meanwhile, there are still some great archives
of non-Standard fonts, and users of antiquated computers can
still find such fonts useful in a limited fashion.

Best regards,

James Kass
.



Re: Need encoding conversion routines

2003-03-14 Thread askq1 askq1
From: "Pim Blokland" <[EMAIL PROTECTED]>

What is "some encoding"? You mean codepage-based 8-bit character
sets?
By encoding I mean UTF8, UTF16, UTF32, UCS2 etc.
Actually my requirement is striaght-forward/common and I believe it should 
be available somewhere on net.
In particular I need source code (or some way) for following requirements:
- Convert Unicode code-point to UTF8 encoding and vice-versa.
- Convert Unicode code-point to UCS2 encoding and vice-versa.
- Convert Unicode code-point to UTF16 encoding and vice-versa.

Thanking you in adavance,
~ K.
Pim Blokland


_
Get more buddies in your list. Win prizes http://messenger.msn.co.in/promo



Progressing genealogical symbols

2003-03-14 Thread Michael Everson
At 10:01 +0100 2003-03-14, Otto Stolz wrote:
Dominikus Scherkl had written:
Has anybody meanwhile contributed a proposal regarding the missing 
genealogical symbols? (after Otto Stolz's message from 24.02.2003 I 
wondered this was not proposed long ago - or was it and was 
rejected for some reason?!?).
So do I -- however, I am not in the position to author a formal 
proposition: I am not able to provide a font, and I do not have the 
time for a research beyond the evidence I have already given.
Perhaps you would consider making a donation to the Script Encoding Initiative.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com


Re: Tolkien wanta-be has created entirely new language

2003-03-14 Thread c.t.m. jacobs

- Original Message - 
From: "Doug Ewell" <[EMAIL PROTECTED]>
To: "Unicode Mailing List" <[EMAIL PROTECTED]>
Cc: "Ruddy, James" <[EMAIL PROTECTED]>; "John Hudson" <[EMAIL PROTECTED]>
Sent: Thursday, March 13, 2003 11:09 PM
Subject: Re: Tolkien wanta-be has created entirely new language

[ ... ]

> Make sure they are Unicode-enabled; you really don't
> want to get into the business of redefining the ASCII range (unless it's
> a cipher for ASCII; see below).

[ ... ]

And yet Tengwar seems to redefine the ASCII range ...

http://www.gis.net/~dansmith/fonts/tengwar.htm



Re: Need encoding conversion routines

2003-03-14 Thread Pim Blokland
askq1 askq1 schreef:

> another. (UTF8 to UTF16 etc.) But how can I convert from a
> unicode code-point to some encoding and decode from some
> encoding to unicode code-point? In brief, I want encoding
> and decoding functions for Unicode.

What is "some encoding"? You mean codepage-based 8-bit character
sets?
I once wrote a conversion routine, under Windows, that can convert
from UCS-2 to any codepage Windows supports and back, by using the
info in the appropriate *.nls files.
If this is what you want, and you can't find it elsewhere, I can
upload those routines somewhere if you like.

Pim Blokland





Unicode 4.0 chapter headings and numbering.

2003-03-14 Thread William Overington
I wonder if you could please say whether the Unicode 4.0 book will have the
same chapter headings and numbering as the Unicode 3.0 book?

My reason for asking is that I am writing a paper about the possible
problems with using languages of the Indian subcontinent on the DVB-MHP
(Digital Video Broadcasting - Multimedia Home Platform) interactive
television platform where the PFR0 Portable Font Resource system is used for
those fonts which are broadcast.  DVB-MHP uses Java and Unicode.  I want to
refer to Chapter 9 South and Southeast Asian Scripts as the place to look
for the details of what is necessary, yet the paper needs to be usable both
before and after the publication of Unicode 4.0, so I would like to know if
the chapter headings and numbering will be unchanged please as I would like
simply to refer readers to Chapter 9 South and Southeast Asian Scripts of
the Unicode Standard at the http://www.unicode.org webspace.

Also, if unchanged, is that a matter of continuing stability for future
issues as well, or is it just for Unicode 4.0 please?

William Overington

14 March 2003






Re: Ligatures fj etc (from Re: Ligatures (qj) )

2003-03-14 Thread William Overington
Yesterday, 13 March 2003, I wrote as follows.

quote

So I reasoned that the system might scan through a font when it is loaded
and decide upon the lowest point for the whole font and then proceed on that
basis.

end quote

An email correspondent has kindly written to me privately and I now know
that it is not necessary for an application such as a wordprocessing package
to make a complete survey of all the glyphs in a font as the font is being
loaded, because the information on what are the high and low points for the
font is readily available in predefined locations within the font.

I expect that many readers of this list already know that, yet I feel that I
should post this note in case some readers do not because I would not want
to have set them off on a wrong way of looking at how a system works.

William Overington

14 March 2003








Re: New document.

2003-03-14 Thread Otto Stolz
Dominikus Scherkl had written:
> Has anybody meanwhile contributed a proposal
> regarding the missing genealogical symbols ?
> (after Otto Stolz's message from 24.02.2003 I wondered this
> was not proposed long ago - or was it and was rejected for
> some reason?!?).
So do I -- however, I am not in the position to author
a formal proposition: I am not able to provide a font,
and I do not have the time for a research beyond the
evidence I have already given.
Rick McGowan wrote:

Do you have any documentation on these?


The two scans under
  http://www.rz.uni-konstanz.de/Antivirus/tests/li.png
  http://www.rz.uni-konstanz.de/Antivirus/tests/re.png
are from the authoritative (until July 1996) book on German
orthography: Duden "Rechtschreibung der deutschen Sprache
und der Fremdwörter" / hrsg. von d. Dudenred. auf d. Grundlage
d. amtl. Rechtschreibregeln. [Red.Bearb.: Werner Scholze-
Stubenrecht unter Mitw. von Dieter Berger ...]. - 19., neu bearb.
u. erw. Aufl. ISBN: 3-411-20900-3.
Best wishes,
  Otto Stolz



Re: sorting order between win98/xp

2003-03-14 Thread Michael \(michka\) Kaplan
I have tried it on Win98, WinME, Win2000, WinXP, and Windows Server
2003. LCMapStringA/CompareStringA on all five and
LCMapStringW/CompareStringW on the NT-ish platforms.

I do not and cannot repro the reported problem of your colleague.

Please have that colleague send me some email with their repro
scenario, if they have one (I don't think they do, as the code on the
NT platforms does not have this functionality, even as an option).

MichKa

- Original Message - 
From: "Yung-Fong Tang" <[EMAIL PROTECTED]>
To: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Thursday, March 13, 2003 5:20 PM
Subject: Re: sorting order between win98/xp


>
> do you use
> LCMapStringW on WinXP and LCMapStringA on Win98 WITH  LCMAP_SORTKEY
to
> genearate the SORT KEY ?
>
> Have you try on both platforms ? (Win98 and WinXP)?
>
>
> Michael (michka) Kaplan wrote:
>
> >LCMapString does not do the reported behavior either.
ComparesString
> >and LCMapString are based on the same data and return the same
> >results.
> >
> >Your colleague is mistaken.
> >
> >MichKa
> >
> >- Original Message - 
> >From: "Yung-Fong Tang" <[EMAIL PROTECTED]>
> >To: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
> >Cc: <[EMAIL PROTECTED]>
> >Sent: Thursday, March 13, 2003 4:31 PM
> >Subject: Re: sorting order between win98/xp
> >
> >
> >
> >
> >>We cannot use that. The function you mention is to compare two
> >>
> >>
> >Unicode
> >
> >
> >>strings.
> >>We need the function to "generate sort key" from unicode strings
> >>
> >>
> >instead
> >
> >
> >>of compare two string.
> >>
> >>Michael (michka) Kaplan wrote:
> >>
> >>
> >>
> >>>From: "Yung-Fong Tang" <[EMAIL PROTECTED]>
> >>>
> >>>
> >>>
> >>>
> >>>
> One of my colleague ask me this question.
> 
> 
> 
> 
> >>>In the interests of completeness
> >>>
> >>>The function that does the type of sorting your colleague noted
is
> >>>StrCmpLogicalW in shlwapi.dll, version 5.5 and later. See the
> >>>following link for more information (all on one line in the
> >>>
> >>>
> >browser):
> >
> >
>
>>http://msdn.microsoft.com/library/en-us/shellcc/platform/shell/refer
e
> >>
> >>
> >nce/shlwapi/string/strcmplogicalw.asp
> >
> >
> >>>MichKa
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >
> >
> >
>
>