RE: History of Kazakh characters in Unicode

2000-11-15 Thread Jonathan Rosenne

I remember being shown in the ECMA bidi WG a document from China that specified
the use of the Arabic script for Kazakh (I think it was Kazakh), which was
somewhat different from ISO-8859-6 and ASMO. I remember they had fewer shapes.

Jony

> -Original Message-
> From: Michael (michka) Kaplan [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, November 16, 2000 6:41 AM
> To: Unicode List
> Subject: Re: History of Kazakh characters in Unicode
>
>
> Most of these characters came from existing standards that were included in
> Unicode, rather than separately requested character additions. There are
> some exceptions for Cyrillic, and possibly for Arabic but that one I am not
> 100% sure about.
>
> But most of them have been there all along based on compatibility with the
> orginal ISO 8859, MS, IBM, and other legacy code pages.
>
> michka
>
> a new book on internationalization in VB at
> http://www.i18nWithVB.com/
>
> - Original Message -
> From: "Kairat A. Rakhim" <[EMAIL PROTECTED]>
> To: "Unicode List" <[EMAIL PROTECTED]>
> Sent: Wednesday, November 15, 2000 8:24 PM
> Subject: History of Kazakh characters in Unicode
>
>
> > Hello,
> >
> > I'm writing an article about history of Kazakh and other Turkic alphabets.
> > Could you help me with history of their inclusion in Unicode?
> > Who have proposed characters which are specific for Kazakh and all other
> > Turkic languages in Arabic, Latin and Cyrillic scripts? How I can contact
> > with them?
> >
> > Thank you in advance,
> >
> > Kairat A. Rakhim,
> > Regional Universal Science Library of Karaganda,
> > KAZAKHSTAN
> >
> >
> >
>




Re: History of Kazakh characters in Unicode

2000-11-15 Thread Michael \(michka\) Kaplan

Most of these characters came from existing standards that were included in
Unicode, rather than separately requested character additions. There are
some exceptions for Cyrillic, and possibly for Arabic but that one I am not
100% sure about.

But most of them have been there all along based on compatibility with the
orginal ISO 8859, MS, IBM, and other legacy code pages.

michka

a new book on internationalization in VB at
http://www.i18nWithVB.com/

- Original Message -
From: "Kairat A. Rakhim" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Wednesday, November 15, 2000 8:24 PM
Subject: History of Kazakh characters in Unicode


> Hello,
>
> I'm writing an article about history of Kazakh and other Turkic alphabets.
> Could you help me with history of their inclusion in Unicode?
> Who have proposed characters which are specific for Kazakh and all other
> Turkic languages in Arabic, Latin and Cyrillic scripts? How I can contact
> with them?
>
> Thank you in advance,
>
> Kairat A. Rakhim,
> Regional Universal Science Library of Karaganda,
> KAZAKHSTAN
>
>
>




History of Kazakh characters in Unicode

2000-11-15 Thread Kairat A. Rakhim

Hello,

I'm writing an article about history of Kazakh and other Turkic alphabets.
Could you help me with history of their inclusion in Unicode?
Who have proposed characters which are specific for Kazakh and all other
Turkic languages in Arabic, Latin and Cyrillic scripts? How I can contact
with them?

Thank you in advance,

Kairat A. Rakhim,
Regional Universal Science Library of Karaganda,
KAZAKHSTAN





Re: Unicode not approved by China

2000-11-15 Thread Kenneth Whistler

Bjorn Stabell reported:

> http://linuxfab.cx/indexNewsData.php?NEWSID=2949&FIRSTHIT=1
> 
> According to this news item (in Chinese), China rejected HK's
> application to use Unicode, and instead says they have to use
> ISO 10646-1:2000 or GB18030.  Apparently they don't like to
> standardize on a standard controlled by an organization of
> commercial companies, like Unicode.

This is not an uncommon reaction among officious organizations
that think only ISO or governments can create reliable, open
standards.

It is the basic reason why the Unicode Consortium goes to
such lengths to guarantee that the Unicode Standard is
*exactly* aligned with ISO 10646 (as noted repeatedly in
the standard itself and on the Unicode website).

> 
> This is confusing.  Nobody implements ISO 10646-1:2000 as
> such, they just implement Unicode, right? 

Right.

> I thought the two
> standards were equivalent? 

They are. And we went the extra mile with JTC1/SC2/WG2 to ensure
that ISO 10646-1:2000 and the Unicode Standard, Version 3.0, were
not only equivalent, but also published more or less simultaneously,
with the same publication year.

The charts and name lists for the two standards were even driven
off the same data sources and using the same suite of fonts, to
guarantee synchronization. 

> We're using Unicode because of
> practical reasons, because there's a lot of applications supporting
> it and it solves the character set problem.  What do you suggest
> we do, being based in Beijing, China?

Implement 10646-1:2000 and tell the government of China that that
is what you are doing.

Of course, in order to implement 10646-1:2000, you will need an
extensive set of guidelines on implementation issues. And I guess
you know where to look for those.

> 
> In December, the Chinese will go to Taiwan to try to settle on a
> common encoding.

Interesting.

--Ken Whistler




Unicode not approved by China

2000-11-15 Thread Bjorn Stabell

http://linuxfab.cx/indexNewsData.php?NEWSID=2949&FIRSTHIT=1

According to this news item (in Chinese), China rejected HK's
application to use Unicode, and instead says they have to use
ISO 10646-1:2000 or GB18030.  Apparently they don't like to
standardize on a standard controlled by an organization of
commercial companies, like Unicode.

This is confusing.  Nobody implements ISO 10646-1:2000 as
such, they just implement Unicode, right?  I thought the two
standards were equivalent?  We're using Unicode because of
practical reasons, because there's a lot of applications supporting
it and it solves the character set problem.  What do you suggest
we do, being based in Beijing, China?

In December, the Chinese will go to Taiwan to try to settle on a
common encoding.

Kind regards,
-- 
Bjorn Stabell <[EMAIL PROTECTED]>
Exoweb - One-to-one web solutions
w http://www.exoweb.net/
t +86 13701174004



Re: [idn] Javascript code charts, unicode converter, show-characters

2000-11-15 Thread Mark Davis



I believe that result is incorrect. The 
RACE has 48 bytes, so 44 bytes of Base32. That translates to 44 * 5 bits = 
220 bits, or 27 bytes of compressed UTF-16. That must represent *at least* 13 
UTF-16 characters, but the enclosed file only has 5 Hangul Syllables. If that 
was generated programmatically, the program is wrong.
 
Mark

  - Original Message - 
  From: 
  J. 
  William Semich 
  To: Rick H Wesson ; Mark Davis 
  Cc: Unicore ; Unicode ; [EMAIL PROTECTED] ; w3c-i18n-ig 
  
  Sent: Wednesday, November 15, 2000 
  09:32
  Subject: Re: [idn] Javascript code 
  charts, unicode converter, show-characters
  Here's the UTF-8 encoding of the Hangul (attached)Bill 
  SemichWorldNames, IncAt 08:54 AM 11/15/00 -0800, Rick H Wesson 
  wrote:>>>On Wed, 15 Nov 2000, Mark Davis 
  wrote: (Paul noted that someone had 
  registered>> "BQ--3AADEABQAAYAAMQAMYAGSADGABQ4NVFU3THPLTTUW" with 
  VGRS. My program>> says it's an error -- it appears to have an extra 
  W at the end. The>> source text appears to be hangul:>> 
  í??í?²í??í?°í??í?°í??í?²í??큦í??큩í??큦í??큡í,¶íf"í,´íf"ífZífµífZ클)>>BQ--3AADEABQAAYAAMQAMYAGSADGABQ4NVFU3THPLTTUW 
  is not registered, 
  however;>BQ--3AADEABQAAYAAMQAMYAGSADGABQ4NVFU3THPLTTUWTOA.COM is 
  registered.>> >   Domain Name: 
  BQ--3AADEABQAAYAAMQAMYAGSADGABQ4NVFU3THPLTTUWTOA.COM>   
  Registrar: GABIA, INC.>   Whois Server: 
  whois.name7.com>   Referral URL: 
  www.name7.com>   Name Server: 
  NS1.NAME7.COM>   Name Server: 
  NS2.NAME7.COM>   Updated Date: 
  10-nov-2000>   
  >>-rick
  
  

  Bill SemichPresident and FounderWorldNames, 
  Inc.http://www.worldnames.net[EMAIL PROTECTED]


Re: Java and Unicode

2000-11-15 Thread Markus Scherer

Please let's keep types for single characters and types for strings separate.

ICU used to be in the same situation as Java: everything character/string used 16-bit 
types.
In extension to UTF-16, we decided to keep the string base type at 16 bits for very 
good reasons like interoperability and memory consumption.
For single characters, ICU changed APIs from 16-bit to 32-bit types.

In the case of Java, the equivalent course of action would be to stick with a 16-bit 
char as the base type for strings. The int type could be used in _additional_ APIs for 
single Unicode code points, deprecating the old APIs with char.

Whatever Sun decides to do with single characters, it will be most reasonable to keep 
the string encoding the same and just treat it as UTF-16 where that makes a difference.

For details, see my presentation at the IUC 17 Unicode conference (2000 September, 
session B2).
(See http://www.unicode.org/ - I am having some trouble with web access right now, so 
I cannot give you the URL...)

markus



Fwd: Changes proposed for Tamil

2000-11-15 Thread AvaFonts

Dear Chris Fynn,

This font should be tried without Uniscribe support and without any other 
fully conformant Tamil fonts to understand the scientific principles behind 
the current recommendations. Of course in real use these supports are 
essential. I'll soon be publishing the new version font for real use where 
the opportunity to see the raw data will be reduced.
 
Sinnathurai Srivas

 << The proposal is not acceptable. The Current state of allocations is  
based   onscientific principles. The new proposal is of usage based 
principle. It isnot only "ai" (I guess its not AI as described below), 
but also au, e, ee,  o,   oo have the similar charactersitics.  Unless all of 
these are changed to  usage based one, it should not be accepted as a 
solution/change. A mixed  solution is a recipi for disaster.  Contextual is 
only acceptable if all of the recommendations are contextual based. In my 
opinion the current  scientific solution should be kept in tact so that the 
process handling language becomes sophisticated as it is scientific based. 
  
  As I do not have detail information on this proposal, if my assumption 
about  the proposal is wrong please correct me.
  
  I have published a Tamil Unicode Font for test purposes. This proposal and 
the other characters I mentioned above may be better understood by visiting 
(please do not visit if you do not wish to view sicML) 
  http://www.geocities.com/avarangal/tamilunicode.html
  
  Sinnathurai Srivas 
   <
   
   They involve a change to the contextual processing model involving the AI
   vowel.
   
   John F.
>>

Dear Chris Fynn,

here it is. Though intended, somehow it missed the Unicode list before.
<<<
Perhaps it would be more worthwhile to discuss the proposed changes to Tamil 
on the Unicode list [EMAIL PROTECTED] rather than on the OpenType list - at 
the very least some mention should be made on the Unicode list as well.
So far nothing seems to have been said there about these proposed changes. 
BTW Though I don't read Tamil, I seem to get your web page rendered correctly 
in all but one or two places without your font installed  (I do have 
Microsoft's Tamil IME installed under Win 2K)

Chris Fynn
Dzongkha Computing Project
Thimphu Bhutan





url address corrected
http://www.geocities.com/avarangal/tamilunicode.html

<< The proposal is not acceptable. The Current state of allocations is  based 
on 
 scientific principles. The new proposal is of usage based principle. It is 
 not only "ai" (I guess its not AI as described below), but also au, e, ee, 
o, 
 oo have the similar charactersitics.  Unless all of these are changed to 
 usage based one, it should not be accepted as a solution/change. A mixed 
 solution is a recipi for disaster.  Contextual is only acceptable if all of 
 the recomondations are contextual based. In my openion the current 
scientific 
 solution shold be kept in tact so that the process handling language becomes 
 sophisticated as it is scientific based. 
 
 As I do not have detail information on this proposal, if my assumption about 
 the proposal is wrong please correct me.
 
 I have published a Tamil Unicode Font for test purposes. This proposal and 
 the other characters I mentioned above may be better understood by visiting 
 (please do not visit if you do not wish to view sickML) 
 http://www.geocities.com/avarangal/tamilunicode.html
 
 Sinnathurai Srivas 
  <
  
  They involve a change to the contextual processing model involving the AI
  vowel.
  
  John F.
  
   >>
 
  >>






Re: Java and Unicode

2000-11-15 Thread Kenneth Whistler

John O'Conner wrote:

> Yes. If you have been involved with Unicode for any period of time at all, you
> would know that the Unicode consortium has advertised Unicode's 16-bit
> encoding for a long, long time, even in its latest Unicode 3.0 spec. The
> Unicode 3.0 spec clearly favors the 16-bit encoding of Unicode code units, and
> the design chapter (chapter 2) never even hints at a 32-bit encoding form.

Indeed. Though, to be fair, people have been talking about UCS-4 and
then UTF-32 for quite awhile now, and the UTF-32 Technical Report has been
approved for half a year.

FYI, on November 9, the Unicode Technical Committee officially voted
to make Unicode Technical Report #19 "UTF-32" a Unicode Standard Annex (UAX).
This will be effective with the rollout of the Unicode Standard, Version
3.1, and will make the 32-bit transformation format a coequal partner
with UTF-16 and UTF-8 as sanctioned Unicode encoding forms.

> 
> The previous 2.0 spec (and previous specs as well) promoted this 16-bit
> encoding too...and even claimed that Unicode was a 16-bit, "fixed-width",
> coded character set. There are lots of reasons why Java's char is a 16-bit
> value...the fact that the Unicode Consortium itself has promoted and defined
> Unicode as a 16-bit coded character set for so long is probably the biggest.

It is easy to look back from the year 2000 and wonder why.

But it is also important to remember the context of 1989-1991. During
that time frame, the loudest complaints were from those who were
proclaiming that Unicode's move from 8-bit to 16-bit characters would
break all software, choke the databases, inflate all documents by
a factor of two, and generally end the world as we knew it.

As it turns out, they were wrong on all counts. But the rhetorical
structure of the Unicode Standard was initially set up to be a hard
sell for 16-bit characters *as opposed to* 8-bit characters.

The implementation world has moved on. Now we have an encoding model
for Unicode that embraces an 8-bit, a 16-bit, *and* a 32-bit encoding
form, while acknowledging that the character encoding per se is
effectively 21 bits. This is more complicated than we hoped for
originally, of course, but I think most of us agree that the incremental
complexity in encoding forms is a price we are willing to pay in order
to have a single character encoding standard that can interoperate
in 8-, 16-, and 32-bit environments.

--Ken





Persian decimal separator

2000-11-15 Thread Roozbeh Pournader


Dear All,

Some time ago, there was a discussion here about the Persian decimal
separator. I am posting a short report about our queries into different
Iranian bodies. Sorry for the long and somehow formal thing, but it seems
important to us.
 
I'm still waiting for responses from Iranian Academy for Sciences (IAS),
and Iranian Mathematical Society (IMS). I have answers from these sources:
 
* Iranian Academy for Persian Language and Literature (IAPLL);
* Iranian Standards and Industrial Research Institute (ISIRI) which is
  the national standard body;
* Iran University Press (IUP), and Fatemi Publishing Institute (FPI),
  which are the largest and highest quality academic publishing houses
  in Iran.
 
I think that IMS will answer the same as FPI, since they seem to use the
same conventions in their books that is not published by any of these two
houses. They certainly use the house rules when they publish with one of
these two, but not with other houses.
 
I also add our conclusions, as current representatives of HCI (Iranian High
Council of Informatics) in text encoding issues which is the responsible
body for national computing related standards, which is transfered to it
from ISIRI.
 
1. All sources agree that slash and decimal separator should be considered
   different.
 
2. ISIRI has a character set in their standards (ISIRI 3342, the
   rarely-used national standard) which distinguishes the two characters,
   while not distinguishing hyphen from minus or colon from division sign
   (of which the latter case is really weird). They did not give any
   special comments regarding the standard, since the standards commitee
   for the character set issues is dissolved for a long time, and the
   responsiblity was handed to the HCI. The standard shows the glyph for
   the decimal separator as described in 4. They have also another
   standard (ISIRI 2901-revised:1994) for keyboards, that distinguishes
   the two characters.
 
3. IUP and FPI already use the same publishing software that distinguishes
   these, for more than five years, and IUP has distinguished them even
   before that time. They both agree that the the sequence ONE SLASH TWO
   means 0.5 and not 1.2. They specially say this because of the need for
   clear interpretation of in-text formulas. (IAPLL sees this
   interpretion lying beyond its competence, and refered us to the IAS.)
   IUP has also published a scientific style guide which explicitly
   mentions the difference, and asks for a glyph shape described in the
   last part of the next item (I can provide you with copies of the page
   mentioning this. We also use software that distinguishes these.
 
4. All except IUP agree that the glyph shape for the decimal separator
   should be a shortened, lowered and possibly more slanted slash. But
   IUP has changed the default behaviour of the mentioned software to
   use a glyph exactly similiar to the isolated form of REH (U+0631)
   for the decimal separator. This has been the case even in their old
   books, before their adoption of computer software for publishing.
   But the IUP recommendation in this case is considered old tradition 
   by others, including us, and not acceptable. (I can provide digital
   images of text produced by FPI, IUP, and ourself.)
 
5. All except IUP agree that in the case of lacking decimal separator
   in the software, a slash is the best substitute. IUP prefers the
   REH shape in all cases. FPI insisted that using the slash for both
   division and decimal separation is unbearable, and told that in the
   case of a lacking decimal separator glyph, all the text should be
   scanned for use of slash as division sumbol, and those cases transformed
   to two dimentional fractions.
 
6. In the case of missing Persian shape for the decimal separator,
   IUP and FPI (and also us) prefer the Arabic shape over the slash.
   IUP may also prefer the Arabic shape over their REH shape, but that's
   not verified yet. IAPLL prefers the slash over the Arabic glyph.
 
7. All sources agree that for date separation, one should only use the
   slash.
 
 
As final conclusion:
 
  In case of information interchange, when the character set permits, 
  a decimal separator (U+066B) is certainly prefered to a slash (U+002F)
  and must be used. Computer programs should render the Persian U+066B as
  a shortened, lowered, and possibly more slanted slash; this should be
  distinguishable from the slash at the first sight (I can provide
  examples). If the Persian shape is lacking, if the text context is
  mathematical, the Arabic shape must be used. In other cases, the slash
  shape is acceptable (but will be considered illiterate or
  nonprofessional, somehow similiar to using spaces instead of zero width
  non-joiners).
 
(We have not yet received enough responses to our queries about the
thousands separator, but it seems that there will be a lot of disagreement
about this. I can only tell that the national character set and 

Re: Java and Unicode

2000-11-15 Thread John O'Conner



Jungshik Shin wrote:

> That's exactly what I have in mind about Java. I can't help wondering why
> Sun chose 2byte char instead of 4byte char when it was plainly obvious
> that 2byte wouldn't be enough in the very near future. The same can be
> said of Mozilla which internally uses BMP-only as far as I know.
> Was it due to concerns over things like saving memory/storage, etc?

Yes. If you have been involved with Unicode for any period of time at all, you
would know that the Unicode consortium has advertised Unicode's 16-bit
encoding for a long, long time, even in its latest Unicode 3.0 spec. The
Unicode 3.0 spec clearly favors the 16-bit encoding of Unicode code units, and
the design chapter (chapter 2) never even hints at a 32-bit encoding form. The
Java char attempts to capture the basic encoding unit of this 16-bit, widely
accepted encoding method. I'm sure the choice seemed plainly obvious at the
time.

The previous 2.0 spec (and previous specs as well) promoted this 16-bit
encoding too...and even claimed that Unicode was a 16-bit, "fixed-width",
coded character set. There are lots of reasons why Java's char is a 16-bit
value...the fact that the Unicode Consortium itself has promoted and defined
Unicode as a 16-bit coded character set for so long is probably the biggest.

-- John O'Conner





Re: Hindi editor

2000-11-15 Thread jgo

>>> On Wed, 2000 Nov 15 05:18:24 -0800 (GMT-0800) nikita k
>>> <[EMAIL PROTECTED]> wrote:
>>> Is there any text editor by which data can be entered
>>> in Hindi?
>>>
>>> Rgds,
>>> Nikita K

You could use Nisus Writer.  However, we currently have an
unsolved bug in our support of Hindi, as the insertion point
is not always in the correct location, and double-clicking a
"word" does not (necessarily) select said word, etc. but
"data entry" works if you have the correct Language Kit.

John G. Otto Nisus Software, Engineering
www.infoclick.com  www.mathhelp.com  www.nisus.com  software4usa.com
EasyAlarms  PowerSleuth  NisusEMail  NisusWriter  MailKeeper  QUED/M
   My opinions are probably not those of Nisus Software, Inc.





Re: Java and Unicode

2000-11-15 Thread John Jenkins


On Wednesday, November 15, 2000, at 12:08 PM, Roozbeh Pournader wrote:

> 
> 
> On Wed, 15 Nov 2000, Michael (michka) Kaplan wrote:
> 
> > I do not think they are so theoretical, with both 10646 and Unicode
> > including them in the very new future (unless you count it as 
> theoretical
> > when you drop an egg but it has not yet hit the ground!).
> 
> Lemme think. You're saying that when I have not even seen a single egg
> hitting the ground, I should believe that it will hit some day? ;)
> 
> 

Well, you should be expecting about 45,000 eggs within the next six months.  




RE: sort of OT: politics and scripts

2000-11-15 Thread Cathy Wissink

The Soviet language policies under both Lenin and Stalin were amazing in
what they managed to change in a very short time, especially considering the
scripts first shifted from Arabic to Latin, then just a decade or so later
to Cyrillic.  I too have been wondering when there would be a movement in
the post-Soviet, Central Asian countries away from Cyrillic; my assumption
has always been that they would want to return to Arabic (or for others,
back to their indigenous scripts).

Surprisingly, however, in our NLS implementation, the movement is away from
Cyrillic, as you noted, but towards Latin rather than Arabic.  We've seen
this in Azeri and Uzbek, in that we support both Cyrillic and Latin, with
other Central Asian languages likely to use the same script support.  When
asking our language sources and specialists about eventual migration to
Arabic, there seems to be much less interest in it compared to Latin.  So it
might be the case that there is interest in "extended Arabic" from a
historical perspective, but not from a current IT perspective.  

(Then again, world events play a much greater role in language policy than
can often be anticipated, and the trend could change very quickly and this
could all be moot...)

Cathy

-Original Message-
From: Elaine Keown [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, November 15, 2000 11:53 AM
To: Unicode List
Subject: sort of OT: politics and scripts


Hello, 

A similar question to the question of new Chinese characters and new
versions of characters for Lakota, but an order of magnitude larger, is the
question of ongoing or about-to-hit-us script changes in Central Asia.  

In the 1920s-1940s, under a series of Soviet language policy changes, many
Central Asian languages were converted from Arabic script to Roman to
Cyrillic (or some different permutation even).  Jewish Central Asian
languages were converted from Hebrew to Cyrillic.  

Now as the independent republics take control, there is evidence that the
abandonment of Cyrillic has started, and there is a return to Arabic script.
But not "plain vanilla" Arabic script, but the extended Arabic scripts with
extra symbols.. 

This gives Unicode an odd "legacy code" problem, indeed.---Elaine Keown

___

Free Unlimited Internet Access! Try it now! 
http://www.zdnet.com/downloads/altavista/index.html

___



Re: Java and Unicode

2000-11-15 Thread Jungshik Shin

On Wed, 15 Nov 2000, Thomas Chan wrote:

> On Wed, 15 Nov 2000, Jungshik Shin wrote:
> 
> > On Wed, 15 Nov 2000, Michael (michka) Kaplan wrote:
> > > 
> > > Many people try to compare this to DBCS, but it really is not the same
> > > thing understanding lead bytes and trail bytes in DBCS is *astoundingly*
> > > more complicated than handling surrogate pairs.
> > 
> > Well, it depends on what multibyte encoding you're talking about. In case
> > of 'pure' EUC encodings (EUC-JP, EUC-KR, EUC-CN, EUC-TW) as opposed to
> > SJIS(Windows94?), Windows-949(UHC), Windows-950,  WIndows-125x(JOHAB),
> > ISO-2022-JP(-2), ISO-2022-KR, ISO-2022-CN , it's not that hard (about
> > the same as UTF-16, I believe, especially in case of   EUC-CN and EUC-KR)
> 
> I would move EUC-JP and EUC-TW, and possibly EUC-KR (if you use more than
> KS X 1001 in it) to the "complicated" group because of the shifting bytes
> required to get to different planes/character sets.

Well, EUC-KR has never used character sets other than US-ASCII(or
its Korean variant KS X 1003) and KS X 1001 although a theoretical
possibilty is there. More realistic (although very rarely used. there
are only two known implementations :Hanterm - Korean xterm - and Mozilla
)  complication for EUC-KR arises not from a third character set (KS X
1002) in EUC-KR but from 8byte-sequence representation of (11172-2350)
Hangul syllables not covered by the repertoire of KS X 1001.

As for EUC-JP(which uses JIS X 201/US-ASCII, JIS X 208 AND JIS X 0212)
and EUC-TW, I know what you're saying. That's exactly why I added at
the end of my prev. message 'especially in case of EUC-CN and EUC-KR'
:-) Probably, I should have written among multibyte encodings at least
EUC-CN and EUC-KR are as easy to handle as UTF-16.

Jungshik Shin




Re: Java and Unicode

2000-11-15 Thread Roozbeh Pournader



On Wed, 15 Nov 2000, Michael (michka) Kaplan wrote:

> I do not think they are so theoretical, with both 10646 and Unicode
> including them in the very new future (unless you count it as theoretical
> when you drop an egg but it has not yet hit the ground!).

Lemme think. You're saying that when I have not even seen a single egg
hitting the ground, I should believe that it will hit some day? ;)





sort of OT: politics and scripts

2000-11-15 Thread Elaine Keown

Hello, 

A similar question to the question of new Chinese characters and new versions of 
characters for Lakota, but an order of magnitude larger, is the question of ongoing or 
about-to-hit-us script changes in Central Asia.  

In the 1920s-1940s, under a series of Soviet language policy changes, many Central 
Asian languages were converted from Arabic script to Roman to Cyrillic (or some 
different permutation even).  Jewish Central Asian languages were converted from 
Hebrew to Cyrillic.  

Now as the independent republics take control, there is evidence that the abandonment 
of Cyrillic has started, and there is a return to Arabic script.  But not "plain 
vanilla" Arabic script, but the extended Arabic scripts with extra symbols.. 

This gives Unicode an odd "legacy code" problem, indeed.---Elaine Keown

___

Free Unlimited Internet Access! Try it now! 
http://www.zdnet.com/downloads/altavista/index.html

___




Re: Java and Unicode

2000-11-15 Thread Thomas Chan

On Wed, 15 Nov 2000, Jungshik Shin wrote:

> On Wed, 15 Nov 2000, Michael (michka) Kaplan wrote:
> > In any case, I think that UTF-16 is the answer here.
> > 
> > Many people try to compare this to DBCS, but it really is not the same
> > thing understanding lead bytes and trail bytes in DBCS is *astoundingly*
> > more complicated than handling surrogate pairs.
> 
> Well, it depends on what multibyte encoding you're talking about. In case
> of 'pure' EUC encodings (EUC-JP, EUC-KR, EUC-CN, EUC-TW) as opposed to
> SJIS(Windows94?), Windows-949(UHC), Windows-950,  WIndows-125x(JOHAB),
> ISO-2022-JP(-2), ISO-2022-KR, ISO-2022-CN , it's not that hard (about
> the same as UTF-16, I believe, especially in case of   EUC-CN and EUC-KR)

I would move EUC-JP and EUC-TW, and possibly EUC-KR (if you use more than
KS X 1001 in it) to the "complicated" group because of the shifting bytes
required to get to different planes/character sets.


Thomas Chan
[EMAIL PROTECTED]





RE: Devanagari question

2000-11-15 Thread Ayers, Mike


> From: Rick McGowan [mailto:[EMAIL PROTECTED]]
>
> Mike Ayers wrote:
>
> > The last I knew,
> > computer-savvy Taiwan and Hong Kong were continuing to invent new
> > characters.  In the end, the onus is on the computer to
> support the user.
>
> Yes, the computer should support the user, but... The
> invention of new characters to serve multitudes is OK, and
> international standards will probably continue to support
> that.  But I don't think it's reasonable or appropriate to
> keep inventing new characters willy-nilly for individuals (as
> reported), and then expect them to be added to an
> international standard.  That's silly.  The onus is not on
> international standards to support the whimsical production
> of novel, rarely-used, or nonce characters of the type
> reported to be generated.

That is not established.  The degree to which computer or user will
dictate what will and will not be permitted has yet to be decided.
Certainly, I already have full support for any words that I care to make up
- I need merely spell them.  Since hanzi are words-as-characters, the issue
is much more cloudy, since the position of the Unicode specification (due to
the encoding method used) is that hanzi are characters-only.  This may not
be the final solution.

> In any case, I still have never seen actual documentary
> evidence that would prove to me that in fact Taiwan and Hong
> Kong *ARE* creating new characters at the drop of a hat.
> People just keep saying that to scare everyone.  Sounds like
> an urban myth to me.

Good point.  I will go seek a definitive answer.  Not much point in
discussing this if it doesn't really happen.


/|/|ike



Re: Sinhala Fonts

2000-11-15 Thread Antoine Leca

David Tooke wrote:
> 
> Does anyone know of a freely available font with Unicode encodings containing 
>characters in
> the Sinhala range (0D80-0DFF)?

" freely available " ... Challenging question, for sure.

 
> I can find several fonts with the character set, but none with Unicode encodings...
> they seem to map to the Latin range instead.

This is hardly surprising, since the Unicode encoding of Sinhala is so recent that 
almost
none of the Unicode rendering engines available a year ago was able to deal with this
script (and a number of them are still not able).

Anyway, I believe your best way to search would be to look after Omega. Yannis did 
produce
a Sinhala font years ago, back in the TeX period, and I believe he could have adapt it 
to
Omega. The mailing list for Omega is mailto:[EMAIL PROTECTED]>, and the web page is at
http://omega-system.sourceforge.net> (changed recently).


Antoine



Re: Java and Unicode

2000-11-15 Thread Jungshik Shin

On Wed, 15 Nov 2000, Doug Ewell wrote:

> Elliotte Rusty Harold <[EMAIL PROTECTED]> wrote:
> 
> > There are a number of possibilities that don't break backwards 
> > compatibility (making trans-BMP characters require two chars rather 
> > than one, defining a new wchar primitive data type that is 4-bytes 
> > long as well as the old 2-byte char type, etc.) but they all make the 
> > language a lot less clean and obvious. In fact, they all more or less 

> This is one of the great difficulties in creating a "clean" design:
> making it flexible enough so that it remains clean even in the face of
> unexpected changes (like Unicode requiring more than 16 bits).
> 
> But was it really unexpected?  I wonder when the Java specification was
> written -- specifically, was it before or after Unicode and JTC1/SC2/WG2
> began talking openly about moving beyond 16 bits?

That's exactly what I have in mind about Java. I can't help wondering why
Sun chose 2byte char instead of 4byte char when it was plainly obvious
that 2byte wouldn't be enough in the very near future. The same can be
said of Mozilla which internally uses BMP-only as far as I know.
Was it due to concerns over things like saving memory/storage, etc? 

Jungshik Shin




Re: Java and Unicode

2000-11-15 Thread Jungshik Shin

On Wed, 15 Nov 2000, Michael (michka) Kaplan wrote:

> In any case, I think that UTF-16 is the answer here.
> 
> Many people try to compare this to DBCS, but it really is not the same
> thing understanding lead bytes and trail bytes in DBCS is *astoundingly*
> more complicated than handling surrogate pairs.

Well, it depends on what multibyte encoding you're talking about. In case
of 'pure' EUC encodings (EUC-JP, EUC-KR, EUC-CN, EUC-TW) as opposed to
SJIS(Windows94?), Windows-949(UHC), Windows-950,  WIndows-125x(JOHAB),
ISO-2022-JP(-2), ISO-2022-KR, ISO-2022-CN , it's not that hard (about
the same as UTF-16, I believe, especially in case of   EUC-CN and EUC-KR)

Jungshik Shin




RE: Java and Unicode

2000-11-15 Thread Marco . Cimarosti

Eliotte Rusty Harold wrote:

> One thing I'm very curious about going forward: Right now character 
> values greater than 65535 are purely theoretical. However this will 
> change. It seems to me that handling these characters properly is 
> going to require redefining the char data type from two bytes to 
> four. This is a major incompatible change with existing Java.
> (...)

John O'Conner just wrote something about surrogates
(http://www.unicode.org/unicode/faq/utf_bom.html#16) and UTF-16
(http://www.unicode.org/unicode/faq/utf_bom.html#5) in Java, but your
message was probably already on its way:

> You can currently store UTF-16 in the String and StringBuffer 
> classes. However,
> all operations are on char values or 16-bit code units. The 
> upcoming release of
> the J2SE platform will include support for Unicode 3.0 (maybe 3.0.1)
> properties, case mapping, collation, and character break 
> iteration. There is no
> explicit support for surrogate pairs in Unicode at this time, 
> although you can
> certainly find out if a code unit is a surrogate unit.
> 
> In the future, as characters beyond 0x become more 
> important, you can
> expect that more robust, official support will ollow.
> 
> -- John O'Conner

_ Marco



Sinhala Fonts

2000-11-15 Thread David Tooke



Does anyone know of a freely available font with 
Unicode encodings containing characters in the Sinhala range 
(0D80-0DFF)?
 
I can find several fonts with the character set, 
but none with Unicode encodings...they seem to map to the Latin range 
instead.
 
Thanks in advance.
 
David Tooke


Re: Java and Unicode

2000-11-15 Thread Doug Ewell

Elliotte Rusty Harold <[EMAIL PROTECTED]> wrote:

> There are a number of possibilities that don't break backwards 
> compatibility (making trans-BMP characters require two chars rather 
> than one, defining a new wchar primitive data type that is 4-bytes 
> long as well as the old 2-byte char type, etc.) but they all make the 
> language a lot less clean and obvious. In fact, they all more or less 
> make Java feel like C and C++ feel when working with Unicode: like 
> something new has been bolted on after the fact, and it doesn't 
> really fit the old design.

This is one of the great difficulties in creating a "clean" design:
making it flexible enough so that it remains clean even in the face of
unexpected changes (like Unicode requiring more than 16 bits).

But was it really unexpected?  I wonder when the Java specification was
written -- specifically, was it before or after Unicode and JTC1/SC2/WG2
began talking openly about moving beyond 16 bits?

-Doug Ewell
 Fullerton, California



Re: Java and Unicode

2000-11-15 Thread Michael \(michka\) Kaplan

I do not think they are so theoretical, with both 10646 and Unicode
including them in the very new future (unless you count it as theoretical
when you drop an egg but it has not yet hit the ground!).

In any case, I think that UTF-16 is the answer here.

Many people try to compare this to DBCS, but it really is not the same
thing understanding lead bytes and trail bytes in DBCS is *astoundingly*
more complicated than handling surrogate pairs.

michka

a new book on internationalization in VB at
http://www.i18nWithVB.com/

- Original Message -
From: "Elliotte Rusty Harold" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Wednesday, November 15, 2000 6:15 AM
Subject: Re: Java and Unicode


> One thing I'm very curious about going forward: Right now character
> values greater than 65535 are purely theoretical. However this will
> change. It seems to me that handling these characters properly is
> going to require redefining the char data type from two bytes to
> four. This is a major incompatible change with existing Java.
>
> There are a number of possibilities that don't break backwards
> compatibility (making trans-BMP characters require two chars rather
> than one, defining a new wchar primitive data type that is 4-bytes
> long as well as the old 2-byte char type, etc.) but they all make the
> language a lot less clean and obvious. In fact, they all more or less
> make Java feel like C and C++ feel when working with Unicode: like
> something new has been bolted on after the fact, and it doesn't
> really fit the old design.
>
> Are there any plans for handling this?
> --
>
> +---++---+
> | Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer |
> +---++---+
> |  The XML Bible (IDG Books, 1999)   |
> |  http://metalab.unc.edu/xml/books/bible/   |
> |   http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/   |
> +--+-+
> |  Read Cafe au Lait for Java News:  http://metalab.unc.edu/javafaq/ |
> |  Read Cafe con Leche for XML News: http://metalab.unc.edu/xml/ |
> +--+-+
>




Lakota--Oops!

2000-11-15 Thread James E. Agenbroad

 Wednesday, November 14, 2000
Oh I see the long right leg is straight.  Sorry.

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




Re: Lakota (was Re: OT: Devanagari question)

2000-11-15 Thread James E. Agenbroad

On Tue, 14 Nov 2000, Rick McGowan wrote:

> [EMAIL PROTECTED] wrote:
> 
> > Unfortunately, there's no corresponding LATIN CAPITAL LETTER N WITH LONG
> > RIGHT LEG, which Lakota needs.
> 
> To my knowledge, the discussion in September between John Cowan and Curtis
> Clark didn't terminate with any actual proposal, and I'm not clear on
> whether the above assertion is a fact.  I'm not saying I know anything
> about this field either.  Does Lakota REALLY need a letter that isn't in
> Unicode?
> 
> Are you in a position to provide documents and evidence, and/or make a 
> definite proposal for adding this character?  It would be a good thing
> to add, if it's really needed.
> 
>   Rick
> 
> 
 Wednesdy, November 15, 2000
Page 311 under "Dakota (Sioux)" in Van Ostermann's Manual of foreign
languages (full citation in Unicode 3.0 on page 1008) shows both captial
and small N with long right leg curving to the left.  They both are
also under Sioux on page 253 of Giliarevskii's Languages identification
guide (full citation in 3.0 at page 1005).  To me they look like U+014A
and U+014B (called 'eng'). Am I missing something? 

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




Re: Java and Unicode

2000-11-15 Thread Elliotte Rusty Harold

One thing I'm very curious about going forward: Right now character 
values greater than 65535 are purely theoretical. However this will 
change. It seems to me that handling these characters properly is 
going to require redefining the char data type from two bytes to 
four. This is a major incompatible change with existing Java.

There are a number of possibilities that don't break backwards 
compatibility (making trans-BMP characters require two chars rather 
than one, defining a new wchar primitive data type that is 4-bytes 
long as well as the old 2-byte char type, etc.) but they all make the 
language a lot less clean and obvious. In fact, they all more or less 
make Java feel like C and C++ feel when working with Unicode: like 
something new has been bolted on after the fact, and it doesn't 
really fit the old design.

Are there any plans for handling this?
-- 

+---++---+
| Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer |
+---++---+
|  The XML Bible (IDG Books, 1999)   |
|  http://metalab.unc.edu/xml/books/bible/   |
|   http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/   |
+--+-+
|  Read Cafe au Lait for Java News:  http://metalab.unc.edu/javafaq/ |
|  Read Cafe con Leche for XML News: http://metalab.unc.edu/xml/ |
+--+-+



(no subject)

2000-11-15 Thread nikita k

Hi,
Is there any text editor by which data can be entered
in Hindi?

Rgds,
Nikita K

__
Do You Yahoo!?
Yahoo! Calendar - Get organized for the holidays!
http://calendar.yahoo.com/



Javascript code charts, unicode converter, show-characters

2000-11-15 Thread Mark Davis



I just made some fixes in my Javascript Unicode 
pages (insomnia again) that may be of interest.
 
http://www.macchiato.com/unicode/convert.html has 
UTF, RACE and LACE conversions, with a bit better error checking.
 
http://www.macchiato.com/unicode/charts.html has 
Unicode charts, plus a new "filter" on the left.
 
http://www.macchiato.com/unicode/show.html lets 
you type or paste in Unicode text, and see GIFs (in case fonts are 
missing).
 
Feedback is welcome, though I make no apologies for 
the simple GUI.
 
Mark
 
(Paul noted that someone had registered "BQ--3AADEABQAAYAAMQAMYAGSADGABQ4NVFU3THPLTTUW" with VGRS. 
My program says it's an error -- it appears to have an extra W at the end. The 
source text appears to be hangul: 퀀퀲퀀퀰퀀퀰퀀퀲퀀큦퀀큩퀀큦퀀큡킶탔킴탔탎탵탎클)