RE: History of Kazakh characters in Unicode
I remember being shown in the ECMA bidi WG a document from China that specified the use of the Arabic script for Kazakh (I think it was Kazakh), which was somewhat different from ISO-8859-6 and ASMO. I remember they had fewer shapes. Jony > -Original Message- > From: Michael (michka) Kaplan [mailto:[EMAIL PROTECTED]] > Sent: Thursday, November 16, 2000 6:41 AM > To: Unicode List > Subject: Re: History of Kazakh characters in Unicode > > > Most of these characters came from existing standards that were included in > Unicode, rather than separately requested character additions. There are > some exceptions for Cyrillic, and possibly for Arabic but that one I am not > 100% sure about. > > But most of them have been there all along based on compatibility with the > orginal ISO 8859, MS, IBM, and other legacy code pages. > > michka > > a new book on internationalization in VB at > http://www.i18nWithVB.com/ > > - Original Message - > From: "Kairat A. Rakhim" <[EMAIL PROTECTED]> > To: "Unicode List" <[EMAIL PROTECTED]> > Sent: Wednesday, November 15, 2000 8:24 PM > Subject: History of Kazakh characters in Unicode > > > > Hello, > > > > I'm writing an article about history of Kazakh and other Turkic alphabets. > > Could you help me with history of their inclusion in Unicode? > > Who have proposed characters which are specific for Kazakh and all other > > Turkic languages in Arabic, Latin and Cyrillic scripts? How I can contact > > with them? > > > > Thank you in advance, > > > > Kairat A. Rakhim, > > Regional Universal Science Library of Karaganda, > > KAZAKHSTAN > > > > > > >
Re: History of Kazakh characters in Unicode
Most of these characters came from existing standards that were included in Unicode, rather than separately requested character additions. There are some exceptions for Cyrillic, and possibly for Arabic but that one I am not 100% sure about. But most of them have been there all along based on compatibility with the orginal ISO 8859, MS, IBM, and other legacy code pages. michka a new book on internationalization in VB at http://www.i18nWithVB.com/ - Original Message - From: "Kairat A. Rakhim" <[EMAIL PROTECTED]> To: "Unicode List" <[EMAIL PROTECTED]> Sent: Wednesday, November 15, 2000 8:24 PM Subject: History of Kazakh characters in Unicode > Hello, > > I'm writing an article about history of Kazakh and other Turkic alphabets. > Could you help me with history of their inclusion in Unicode? > Who have proposed characters which are specific for Kazakh and all other > Turkic languages in Arabic, Latin and Cyrillic scripts? How I can contact > with them? > > Thank you in advance, > > Kairat A. Rakhim, > Regional Universal Science Library of Karaganda, > KAZAKHSTAN > > >
History of Kazakh characters in Unicode
Hello, I'm writing an article about history of Kazakh and other Turkic alphabets. Could you help me with history of their inclusion in Unicode? Who have proposed characters which are specific for Kazakh and all other Turkic languages in Arabic, Latin and Cyrillic scripts? How I can contact with them? Thank you in advance, Kairat A. Rakhim, Regional Universal Science Library of Karaganda, KAZAKHSTAN
Re: Unicode not approved by China
Bjorn Stabell reported: > http://linuxfab.cx/indexNewsData.php?NEWSID=2949&FIRSTHIT=1 > > According to this news item (in Chinese), China rejected HK's > application to use Unicode, and instead says they have to use > ISO 10646-1:2000 or GB18030. Apparently they don't like to > standardize on a standard controlled by an organization of > commercial companies, like Unicode. This is not an uncommon reaction among officious organizations that think only ISO or governments can create reliable, open standards. It is the basic reason why the Unicode Consortium goes to such lengths to guarantee that the Unicode Standard is *exactly* aligned with ISO 10646 (as noted repeatedly in the standard itself and on the Unicode website). > > This is confusing. Nobody implements ISO 10646-1:2000 as > such, they just implement Unicode, right? Right. > I thought the two > standards were equivalent? They are. And we went the extra mile with JTC1/SC2/WG2 to ensure that ISO 10646-1:2000 and the Unicode Standard, Version 3.0, were not only equivalent, but also published more or less simultaneously, with the same publication year. The charts and name lists for the two standards were even driven off the same data sources and using the same suite of fonts, to guarantee synchronization. > We're using Unicode because of > practical reasons, because there's a lot of applications supporting > it and it solves the character set problem. What do you suggest > we do, being based in Beijing, China? Implement 10646-1:2000 and tell the government of China that that is what you are doing. Of course, in order to implement 10646-1:2000, you will need an extensive set of guidelines on implementation issues. And I guess you know where to look for those. > > In December, the Chinese will go to Taiwan to try to settle on a > common encoding. Interesting. --Ken Whistler
Unicode not approved by China
http://linuxfab.cx/indexNewsData.php?NEWSID=2949&FIRSTHIT=1 According to this news item (in Chinese), China rejected HK's application to use Unicode, and instead says they have to use ISO 10646-1:2000 or GB18030. Apparently they don't like to standardize on a standard controlled by an organization of commercial companies, like Unicode. This is confusing. Nobody implements ISO 10646-1:2000 as such, they just implement Unicode, right? I thought the two standards were equivalent? We're using Unicode because of practical reasons, because there's a lot of applications supporting it and it solves the character set problem. What do you suggest we do, being based in Beijing, China? In December, the Chinese will go to Taiwan to try to settle on a common encoding. Kind regards, -- Bjorn Stabell <[EMAIL PROTECTED]> Exoweb - One-to-one web solutions w http://www.exoweb.net/ t +86 13701174004
Re: [idn] Javascript code charts, unicode converter, show-characters
I believe that result is incorrect. The RACE has 48 bytes, so 44 bytes of Base32. That translates to 44 * 5 bits = 220 bits, or 27 bytes of compressed UTF-16. That must represent *at least* 13 UTF-16 characters, but the enclosed file only has 5 Hangul Syllables. If that was generated programmatically, the program is wrong. Mark - Original Message - From: J. William Semich To: Rick H Wesson ; Mark Davis Cc: Unicore ; Unicode ; [EMAIL PROTECTED] ; w3c-i18n-ig Sent: Wednesday, November 15, 2000 09:32 Subject: Re: [idn] Javascript code charts, unicode converter, show-characters Here's the UTF-8 encoding of the Hangul (attached)Bill SemichWorldNames, IncAt 08:54 AM 11/15/00 -0800, Rick H Wesson wrote:>>>On Wed, 15 Nov 2000, Mark Davis wrote: (Paul noted that someone had registered>> "BQ--3AADEABQAAYAAMQAMYAGSADGABQ4NVFU3THPLTTUW" with VGRS. My program>> says it's an error -- it appears to have an extra W at the end. The>> source text appears to be hangul:>> í??í?²í??í?°í??í?°í??í?²í??í¦í??í©í??í¦í??í¡í,¶íf"í,´íf"ífZífµífZí´)>>BQ--3AADEABQAAYAAMQAMYAGSADGABQ4NVFU3THPLTTUW is not registered, however;>BQ--3AADEABQAAYAAMQAMYAGSADGABQ4NVFU3THPLTTUWTOA.COM is registered.>> > Domain Name: BQ--3AADEABQAAYAAMQAMYAGSADGABQ4NVFU3THPLTTUWTOA.COM> Registrar: GABIA, INC.> Whois Server: whois.name7.com> Referral URL: www.name7.com> Name Server: NS1.NAME7.COM> Name Server: NS2.NAME7.COM> Updated Date: 10-nov-2000> >>-rick Bill SemichPresident and FounderWorldNames, Inc.http://www.worldnames.net[EMAIL PROTECTED]
Re: Java and Unicode
Please let's keep types for single characters and types for strings separate. ICU used to be in the same situation as Java: everything character/string used 16-bit types. In extension to UTF-16, we decided to keep the string base type at 16 bits for very good reasons like interoperability and memory consumption. For single characters, ICU changed APIs from 16-bit to 32-bit types. In the case of Java, the equivalent course of action would be to stick with a 16-bit char as the base type for strings. The int type could be used in _additional_ APIs for single Unicode code points, deprecating the old APIs with char. Whatever Sun decides to do with single characters, it will be most reasonable to keep the string encoding the same and just treat it as UTF-16 where that makes a difference. For details, see my presentation at the IUC 17 Unicode conference (2000 September, session B2). (See http://www.unicode.org/ - I am having some trouble with web access right now, so I cannot give you the URL...) markus
Fwd: Changes proposed for Tamil
Dear Chris Fynn, This font should be tried without Uniscribe support and without any other fully conformant Tamil fonts to understand the scientific principles behind the current recommendations. Of course in real use these supports are essential. I'll soon be publishing the new version font for real use where the opportunity to see the raw data will be reduced. Sinnathurai Srivas << The proposal is not acceptable. The Current state of allocations is based onscientific principles. The new proposal is of usage based principle. It isnot only "ai" (I guess its not AI as described below), but also au, e, ee, o, oo have the similar charactersitics. Unless all of these are changed to usage based one, it should not be accepted as a solution/change. A mixed solution is a recipi for disaster. Contextual is only acceptable if all of the recommendations are contextual based. In my opinion the current scientific solution should be kept in tact so that the process handling language becomes sophisticated as it is scientific based. As I do not have detail information on this proposal, if my assumption about the proposal is wrong please correct me. I have published a Tamil Unicode Font for test purposes. This proposal and the other characters I mentioned above may be better understood by visiting (please do not visit if you do not wish to view sicML) http://www.geocities.com/avarangal/tamilunicode.html Sinnathurai Srivas < They involve a change to the contextual processing model involving the AI vowel. John F. >> Dear Chris Fynn, here it is. Though intended, somehow it missed the Unicode list before. <<< Perhaps it would be more worthwhile to discuss the proposed changes to Tamil on the Unicode list [EMAIL PROTECTED] rather than on the OpenType list - at the very least some mention should be made on the Unicode list as well. So far nothing seems to have been said there about these proposed changes. BTW Though I don't read Tamil, I seem to get your web page rendered correctly in all but one or two places without your font installed (I do have Microsoft's Tamil IME installed under Win 2K) Chris Fynn Dzongkha Computing Project Thimphu Bhutan url address corrected http://www.geocities.com/avarangal/tamilunicode.html << The proposal is not acceptable. The Current state of allocations is based on scientific principles. The new proposal is of usage based principle. It is not only "ai" (I guess its not AI as described below), but also au, e, ee, o, oo have the similar charactersitics. Unless all of these are changed to usage based one, it should not be accepted as a solution/change. A mixed solution is a recipi for disaster. Contextual is only acceptable if all of the recomondations are contextual based. In my openion the current scientific solution shold be kept in tact so that the process handling language becomes sophisticated as it is scientific based. As I do not have detail information on this proposal, if my assumption about the proposal is wrong please correct me. I have published a Tamil Unicode Font for test purposes. This proposal and the other characters I mentioned above may be better understood by visiting (please do not visit if you do not wish to view sickML) http://www.geocities.com/avarangal/tamilunicode.html Sinnathurai Srivas < They involve a change to the contextual processing model involving the AI vowel. John F. >> >>
Re: Java and Unicode
John O'Conner wrote: > Yes. If you have been involved with Unicode for any period of time at all, you > would know that the Unicode consortium has advertised Unicode's 16-bit > encoding for a long, long time, even in its latest Unicode 3.0 spec. The > Unicode 3.0 spec clearly favors the 16-bit encoding of Unicode code units, and > the design chapter (chapter 2) never even hints at a 32-bit encoding form. Indeed. Though, to be fair, people have been talking about UCS-4 and then UTF-32 for quite awhile now, and the UTF-32 Technical Report has been approved for half a year. FYI, on November 9, the Unicode Technical Committee officially voted to make Unicode Technical Report #19 "UTF-32" a Unicode Standard Annex (UAX). This will be effective with the rollout of the Unicode Standard, Version 3.1, and will make the 32-bit transformation format a coequal partner with UTF-16 and UTF-8 as sanctioned Unicode encoding forms. > > The previous 2.0 spec (and previous specs as well) promoted this 16-bit > encoding too...and even claimed that Unicode was a 16-bit, "fixed-width", > coded character set. There are lots of reasons why Java's char is a 16-bit > value...the fact that the Unicode Consortium itself has promoted and defined > Unicode as a 16-bit coded character set for so long is probably the biggest. It is easy to look back from the year 2000 and wonder why. But it is also important to remember the context of 1989-1991. During that time frame, the loudest complaints were from those who were proclaiming that Unicode's move from 8-bit to 16-bit characters would break all software, choke the databases, inflate all documents by a factor of two, and generally end the world as we knew it. As it turns out, they were wrong on all counts. But the rhetorical structure of the Unicode Standard was initially set up to be a hard sell for 16-bit characters *as opposed to* 8-bit characters. The implementation world has moved on. Now we have an encoding model for Unicode that embraces an 8-bit, a 16-bit, *and* a 32-bit encoding form, while acknowledging that the character encoding per se is effectively 21 bits. This is more complicated than we hoped for originally, of course, but I think most of us agree that the incremental complexity in encoding forms is a price we are willing to pay in order to have a single character encoding standard that can interoperate in 8-, 16-, and 32-bit environments. --Ken
Persian decimal separator
Dear All, Some time ago, there was a discussion here about the Persian decimal separator. I am posting a short report about our queries into different Iranian bodies. Sorry for the long and somehow formal thing, but it seems important to us. I'm still waiting for responses from Iranian Academy for Sciences (IAS), and Iranian Mathematical Society (IMS). I have answers from these sources: * Iranian Academy for Persian Language and Literature (IAPLL); * Iranian Standards and Industrial Research Institute (ISIRI) which is the national standard body; * Iran University Press (IUP), and Fatemi Publishing Institute (FPI), which are the largest and highest quality academic publishing houses in Iran. I think that IMS will answer the same as FPI, since they seem to use the same conventions in their books that is not published by any of these two houses. They certainly use the house rules when they publish with one of these two, but not with other houses. I also add our conclusions, as current representatives of HCI (Iranian High Council of Informatics) in text encoding issues which is the responsible body for national computing related standards, which is transfered to it from ISIRI. 1. All sources agree that slash and decimal separator should be considered different. 2. ISIRI has a character set in their standards (ISIRI 3342, the rarely-used national standard) which distinguishes the two characters, while not distinguishing hyphen from minus or colon from division sign (of which the latter case is really weird). They did not give any special comments regarding the standard, since the standards commitee for the character set issues is dissolved for a long time, and the responsiblity was handed to the HCI. The standard shows the glyph for the decimal separator as described in 4. They have also another standard (ISIRI 2901-revised:1994) for keyboards, that distinguishes the two characters. 3. IUP and FPI already use the same publishing software that distinguishes these, for more than five years, and IUP has distinguished them even before that time. They both agree that the the sequence ONE SLASH TWO means 0.5 and not 1.2. They specially say this because of the need for clear interpretation of in-text formulas. (IAPLL sees this interpretion lying beyond its competence, and refered us to the IAS.) IUP has also published a scientific style guide which explicitly mentions the difference, and asks for a glyph shape described in the last part of the next item (I can provide you with copies of the page mentioning this. We also use software that distinguishes these. 4. All except IUP agree that the glyph shape for the decimal separator should be a shortened, lowered and possibly more slanted slash. But IUP has changed the default behaviour of the mentioned software to use a glyph exactly similiar to the isolated form of REH (U+0631) for the decimal separator. This has been the case even in their old books, before their adoption of computer software for publishing. But the IUP recommendation in this case is considered old tradition by others, including us, and not acceptable. (I can provide digital images of text produced by FPI, IUP, and ourself.) 5. All except IUP agree that in the case of lacking decimal separator in the software, a slash is the best substitute. IUP prefers the REH shape in all cases. FPI insisted that using the slash for both division and decimal separation is unbearable, and told that in the case of a lacking decimal separator glyph, all the text should be scanned for use of slash as division sumbol, and those cases transformed to two dimentional fractions. 6. In the case of missing Persian shape for the decimal separator, IUP and FPI (and also us) prefer the Arabic shape over the slash. IUP may also prefer the Arabic shape over their REH shape, but that's not verified yet. IAPLL prefers the slash over the Arabic glyph. 7. All sources agree that for date separation, one should only use the slash. As final conclusion: In case of information interchange, when the character set permits, a decimal separator (U+066B) is certainly prefered to a slash (U+002F) and must be used. Computer programs should render the Persian U+066B as a shortened, lowered, and possibly more slanted slash; this should be distinguishable from the slash at the first sight (I can provide examples). If the Persian shape is lacking, if the text context is mathematical, the Arabic shape must be used. In other cases, the slash shape is acceptable (but will be considered illiterate or nonprofessional, somehow similiar to using spaces instead of zero width non-joiners). (We have not yet received enough responses to our queries about the thousands separator, but it seems that there will be a lot of disagreement about this. I can only tell that the national character set and
Re: Java and Unicode
Jungshik Shin wrote: > That's exactly what I have in mind about Java. I can't help wondering why > Sun chose 2byte char instead of 4byte char when it was plainly obvious > that 2byte wouldn't be enough in the very near future. The same can be > said of Mozilla which internally uses BMP-only as far as I know. > Was it due to concerns over things like saving memory/storage, etc? Yes. If you have been involved with Unicode for any period of time at all, you would know that the Unicode consortium has advertised Unicode's 16-bit encoding for a long, long time, even in its latest Unicode 3.0 spec. The Unicode 3.0 spec clearly favors the 16-bit encoding of Unicode code units, and the design chapter (chapter 2) never even hints at a 32-bit encoding form. The Java char attempts to capture the basic encoding unit of this 16-bit, widely accepted encoding method. I'm sure the choice seemed plainly obvious at the time. The previous 2.0 spec (and previous specs as well) promoted this 16-bit encoding too...and even claimed that Unicode was a 16-bit, "fixed-width", coded character set. There are lots of reasons why Java's char is a 16-bit value...the fact that the Unicode Consortium itself has promoted and defined Unicode as a 16-bit coded character set for so long is probably the biggest. -- John O'Conner
Re: Hindi editor
>>> On Wed, 2000 Nov 15 05:18:24 -0800 (GMT-0800) nikita k >>> <[EMAIL PROTECTED]> wrote: >>> Is there any text editor by which data can be entered >>> in Hindi? >>> >>> Rgds, >>> Nikita K You could use Nisus Writer. However, we currently have an unsolved bug in our support of Hindi, as the insertion point is not always in the correct location, and double-clicking a "word" does not (necessarily) select said word, etc. but "data entry" works if you have the correct Language Kit. John G. Otto Nisus Software, Engineering www.infoclick.com www.mathhelp.com www.nisus.com software4usa.com EasyAlarms PowerSleuth NisusEMail NisusWriter MailKeeper QUED/M My opinions are probably not those of Nisus Software, Inc.
Re: Java and Unicode
On Wednesday, November 15, 2000, at 12:08 PM, Roozbeh Pournader wrote: > > > On Wed, 15 Nov 2000, Michael (michka) Kaplan wrote: > > > I do not think they are so theoretical, with both 10646 and Unicode > > including them in the very new future (unless you count it as > theoretical > > when you drop an egg but it has not yet hit the ground!). > > Lemme think. You're saying that when I have not even seen a single egg > hitting the ground, I should believe that it will hit some day? ;) > > Well, you should be expecting about 45,000 eggs within the next six months.
RE: sort of OT: politics and scripts
The Soviet language policies under both Lenin and Stalin were amazing in what they managed to change in a very short time, especially considering the scripts first shifted from Arabic to Latin, then just a decade or so later to Cyrillic. I too have been wondering when there would be a movement in the post-Soviet, Central Asian countries away from Cyrillic; my assumption has always been that they would want to return to Arabic (or for others, back to their indigenous scripts). Surprisingly, however, in our NLS implementation, the movement is away from Cyrillic, as you noted, but towards Latin rather than Arabic. We've seen this in Azeri and Uzbek, in that we support both Cyrillic and Latin, with other Central Asian languages likely to use the same script support. When asking our language sources and specialists about eventual migration to Arabic, there seems to be much less interest in it compared to Latin. So it might be the case that there is interest in "extended Arabic" from a historical perspective, but not from a current IT perspective. (Then again, world events play a much greater role in language policy than can often be anticipated, and the trend could change very quickly and this could all be moot...) Cathy -Original Message- From: Elaine Keown [mailto:[EMAIL PROTECTED]] Sent: Wednesday, November 15, 2000 11:53 AM To: Unicode List Subject: sort of OT: politics and scripts Hello, A similar question to the question of new Chinese characters and new versions of characters for Lakota, but an order of magnitude larger, is the question of ongoing or about-to-hit-us script changes in Central Asia. In the 1920s-1940s, under a series of Soviet language policy changes, many Central Asian languages were converted from Arabic script to Roman to Cyrillic (or some different permutation even). Jewish Central Asian languages were converted from Hebrew to Cyrillic. Now as the independent republics take control, there is evidence that the abandonment of Cyrillic has started, and there is a return to Arabic script. But not "plain vanilla" Arabic script, but the extended Arabic scripts with extra symbols.. This gives Unicode an odd "legacy code" problem, indeed.---Elaine Keown ___ Free Unlimited Internet Access! Try it now! http://www.zdnet.com/downloads/altavista/index.html ___
Re: Java and Unicode
On Wed, 15 Nov 2000, Thomas Chan wrote: > On Wed, 15 Nov 2000, Jungshik Shin wrote: > > > On Wed, 15 Nov 2000, Michael (michka) Kaplan wrote: > > > > > > Many people try to compare this to DBCS, but it really is not the same > > > thing understanding lead bytes and trail bytes in DBCS is *astoundingly* > > > more complicated than handling surrogate pairs. > > > > Well, it depends on what multibyte encoding you're talking about. In case > > of 'pure' EUC encodings (EUC-JP, EUC-KR, EUC-CN, EUC-TW) as opposed to > > SJIS(Windows94?), Windows-949(UHC), Windows-950, WIndows-125x(JOHAB), > > ISO-2022-JP(-2), ISO-2022-KR, ISO-2022-CN , it's not that hard (about > > the same as UTF-16, I believe, especially in case of EUC-CN and EUC-KR) > > I would move EUC-JP and EUC-TW, and possibly EUC-KR (if you use more than > KS X 1001 in it) to the "complicated" group because of the shifting bytes > required to get to different planes/character sets. Well, EUC-KR has never used character sets other than US-ASCII(or its Korean variant KS X 1003) and KS X 1001 although a theoretical possibilty is there. More realistic (although very rarely used. there are only two known implementations :Hanterm - Korean xterm - and Mozilla ) complication for EUC-KR arises not from a third character set (KS X 1002) in EUC-KR but from 8byte-sequence representation of (11172-2350) Hangul syllables not covered by the repertoire of KS X 1001. As for EUC-JP(which uses JIS X 201/US-ASCII, JIS X 208 AND JIS X 0212) and EUC-TW, I know what you're saying. That's exactly why I added at the end of my prev. message 'especially in case of EUC-CN and EUC-KR' :-) Probably, I should have written among multibyte encodings at least EUC-CN and EUC-KR are as easy to handle as UTF-16. Jungshik Shin
Re: Java and Unicode
On Wed, 15 Nov 2000, Michael (michka) Kaplan wrote: > I do not think they are so theoretical, with both 10646 and Unicode > including them in the very new future (unless you count it as theoretical > when you drop an egg but it has not yet hit the ground!). Lemme think. You're saying that when I have not even seen a single egg hitting the ground, I should believe that it will hit some day? ;)
sort of OT: politics and scripts
Hello, A similar question to the question of new Chinese characters and new versions of characters for Lakota, but an order of magnitude larger, is the question of ongoing or about-to-hit-us script changes in Central Asia. In the 1920s-1940s, under a series of Soviet language policy changes, many Central Asian languages were converted from Arabic script to Roman to Cyrillic (or some different permutation even). Jewish Central Asian languages were converted from Hebrew to Cyrillic. Now as the independent republics take control, there is evidence that the abandonment of Cyrillic has started, and there is a return to Arabic script. But not "plain vanilla" Arabic script, but the extended Arabic scripts with extra symbols.. This gives Unicode an odd "legacy code" problem, indeed.---Elaine Keown ___ Free Unlimited Internet Access! Try it now! http://www.zdnet.com/downloads/altavista/index.html ___
Re: Java and Unicode
On Wed, 15 Nov 2000, Jungshik Shin wrote: > On Wed, 15 Nov 2000, Michael (michka) Kaplan wrote: > > In any case, I think that UTF-16 is the answer here. > > > > Many people try to compare this to DBCS, but it really is not the same > > thing understanding lead bytes and trail bytes in DBCS is *astoundingly* > > more complicated than handling surrogate pairs. > > Well, it depends on what multibyte encoding you're talking about. In case > of 'pure' EUC encodings (EUC-JP, EUC-KR, EUC-CN, EUC-TW) as opposed to > SJIS(Windows94?), Windows-949(UHC), Windows-950, WIndows-125x(JOHAB), > ISO-2022-JP(-2), ISO-2022-KR, ISO-2022-CN , it's not that hard (about > the same as UTF-16, I believe, especially in case of EUC-CN and EUC-KR) I would move EUC-JP and EUC-TW, and possibly EUC-KR (if you use more than KS X 1001 in it) to the "complicated" group because of the shifting bytes required to get to different planes/character sets. Thomas Chan [EMAIL PROTECTED]
RE: Devanagari question
> From: Rick McGowan [mailto:[EMAIL PROTECTED]] > > Mike Ayers wrote: > > > The last I knew, > > computer-savvy Taiwan and Hong Kong were continuing to invent new > > characters. In the end, the onus is on the computer to > support the user. > > Yes, the computer should support the user, but... The > invention of new characters to serve multitudes is OK, and > international standards will probably continue to support > that. But I don't think it's reasonable or appropriate to > keep inventing new characters willy-nilly for individuals (as > reported), and then expect them to be added to an > international standard. That's silly. The onus is not on > international standards to support the whimsical production > of novel, rarely-used, or nonce characters of the type > reported to be generated. That is not established. The degree to which computer or user will dictate what will and will not be permitted has yet to be decided. Certainly, I already have full support for any words that I care to make up - I need merely spell them. Since hanzi are words-as-characters, the issue is much more cloudy, since the position of the Unicode specification (due to the encoding method used) is that hanzi are characters-only. This may not be the final solution. > In any case, I still have never seen actual documentary > evidence that would prove to me that in fact Taiwan and Hong > Kong *ARE* creating new characters at the drop of a hat. > People just keep saying that to scare everyone. Sounds like > an urban myth to me. Good point. I will go seek a definitive answer. Not much point in discussing this if it doesn't really happen. /|/|ike
Re: Sinhala Fonts
David Tooke wrote: > > Does anyone know of a freely available font with Unicode encodings containing >characters in > the Sinhala range (0D80-0DFF)? " freely available " ... Challenging question, for sure. > I can find several fonts with the character set, but none with Unicode encodings... > they seem to map to the Latin range instead. This is hardly surprising, since the Unicode encoding of Sinhala is so recent that almost none of the Unicode rendering engines available a year ago was able to deal with this script (and a number of them are still not able). Anyway, I believe your best way to search would be to look after Omega. Yannis did produce a Sinhala font years ago, back in the TeX period, and I believe he could have adapt it to Omega. The mailing list for Omega is mailto:[EMAIL PROTECTED]>, and the web page is at http://omega-system.sourceforge.net> (changed recently). Antoine
Re: Java and Unicode
On Wed, 15 Nov 2000, Doug Ewell wrote: > Elliotte Rusty Harold <[EMAIL PROTECTED]> wrote: > > > There are a number of possibilities that don't break backwards > > compatibility (making trans-BMP characters require two chars rather > > than one, defining a new wchar primitive data type that is 4-bytes > > long as well as the old 2-byte char type, etc.) but they all make the > > language a lot less clean and obvious. In fact, they all more or less > This is one of the great difficulties in creating a "clean" design: > making it flexible enough so that it remains clean even in the face of > unexpected changes (like Unicode requiring more than 16 bits). > > But was it really unexpected? I wonder when the Java specification was > written -- specifically, was it before or after Unicode and JTC1/SC2/WG2 > began talking openly about moving beyond 16 bits? That's exactly what I have in mind about Java. I can't help wondering why Sun chose 2byte char instead of 4byte char when it was plainly obvious that 2byte wouldn't be enough in the very near future. The same can be said of Mozilla which internally uses BMP-only as far as I know. Was it due to concerns over things like saving memory/storage, etc? Jungshik Shin
Re: Java and Unicode
On Wed, 15 Nov 2000, Michael (michka) Kaplan wrote: > In any case, I think that UTF-16 is the answer here. > > Many people try to compare this to DBCS, but it really is not the same > thing understanding lead bytes and trail bytes in DBCS is *astoundingly* > more complicated than handling surrogate pairs. Well, it depends on what multibyte encoding you're talking about. In case of 'pure' EUC encodings (EUC-JP, EUC-KR, EUC-CN, EUC-TW) as opposed to SJIS(Windows94?), Windows-949(UHC), Windows-950, WIndows-125x(JOHAB), ISO-2022-JP(-2), ISO-2022-KR, ISO-2022-CN , it's not that hard (about the same as UTF-16, I believe, especially in case of EUC-CN and EUC-KR) Jungshik Shin
RE: Java and Unicode
Eliotte Rusty Harold wrote: > One thing I'm very curious about going forward: Right now character > values greater than 65535 are purely theoretical. However this will > change. It seems to me that handling these characters properly is > going to require redefining the char data type from two bytes to > four. This is a major incompatible change with existing Java. > (...) John O'Conner just wrote something about surrogates (http://www.unicode.org/unicode/faq/utf_bom.html#16) and UTF-16 (http://www.unicode.org/unicode/faq/utf_bom.html#5) in Java, but your message was probably already on its way: > You can currently store UTF-16 in the String and StringBuffer > classes. However, > all operations are on char values or 16-bit code units. The > upcoming release of > the J2SE platform will include support for Unicode 3.0 (maybe 3.0.1) > properties, case mapping, collation, and character break > iteration. There is no > explicit support for surrogate pairs in Unicode at this time, > although you can > certainly find out if a code unit is a surrogate unit. > > In the future, as characters beyond 0x become more > important, you can > expect that more robust, official support will ollow. > > -- John O'Conner _ Marco
Sinhala Fonts
Does anyone know of a freely available font with Unicode encodings containing characters in the Sinhala range (0D80-0DFF)? I can find several fonts with the character set, but none with Unicode encodings...they seem to map to the Latin range instead. Thanks in advance. David Tooke
Re: Java and Unicode
Elliotte Rusty Harold <[EMAIL PROTECTED]> wrote: > There are a number of possibilities that don't break backwards > compatibility (making trans-BMP characters require two chars rather > than one, defining a new wchar primitive data type that is 4-bytes > long as well as the old 2-byte char type, etc.) but they all make the > language a lot less clean and obvious. In fact, they all more or less > make Java feel like C and C++ feel when working with Unicode: like > something new has been bolted on after the fact, and it doesn't > really fit the old design. This is one of the great difficulties in creating a "clean" design: making it flexible enough so that it remains clean even in the face of unexpected changes (like Unicode requiring more than 16 bits). But was it really unexpected? I wonder when the Java specification was written -- specifically, was it before or after Unicode and JTC1/SC2/WG2 began talking openly about moving beyond 16 bits? -Doug Ewell Fullerton, California
Re: Java and Unicode
I do not think they are so theoretical, with both 10646 and Unicode including them in the very new future (unless you count it as theoretical when you drop an egg but it has not yet hit the ground!). In any case, I think that UTF-16 is the answer here. Many people try to compare this to DBCS, but it really is not the same thing understanding lead bytes and trail bytes in DBCS is *astoundingly* more complicated than handling surrogate pairs. michka a new book on internationalization in VB at http://www.i18nWithVB.com/ - Original Message - From: "Elliotte Rusty Harold" <[EMAIL PROTECTED]> To: "Unicode List" <[EMAIL PROTECTED]> Sent: Wednesday, November 15, 2000 6:15 AM Subject: Re: Java and Unicode > One thing I'm very curious about going forward: Right now character > values greater than 65535 are purely theoretical. However this will > change. It seems to me that handling these characters properly is > going to require redefining the char data type from two bytes to > four. This is a major incompatible change with existing Java. > > There are a number of possibilities that don't break backwards > compatibility (making trans-BMP characters require two chars rather > than one, defining a new wchar primitive data type that is 4-bytes > long as well as the old 2-byte char type, etc.) but they all make the > language a lot less clean and obvious. In fact, they all more or less > make Java feel like C and C++ feel when working with Unicode: like > something new has been bolted on after the fact, and it doesn't > really fit the old design. > > Are there any plans for handling this? > -- > > +---++---+ > | Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer | > +---++---+ > | The XML Bible (IDG Books, 1999) | > | http://metalab.unc.edu/xml/books/bible/ | > | http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/ | > +--+-+ > | Read Cafe au Lait for Java News: http://metalab.unc.edu/javafaq/ | > | Read Cafe con Leche for XML News: http://metalab.unc.edu/xml/ | > +--+-+ >
Lakota--Oops!
Wednesday, November 14, 2000 Oh I see the long right leg is straight. Sorry. Regards, Jim Agenbroad ( [EMAIL PROTECTED] ) The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.
Re: Lakota (was Re: OT: Devanagari question)
On Tue, 14 Nov 2000, Rick McGowan wrote: > [EMAIL PROTECTED] wrote: > > > Unfortunately, there's no corresponding LATIN CAPITAL LETTER N WITH LONG > > RIGHT LEG, which Lakota needs. > > To my knowledge, the discussion in September between John Cowan and Curtis > Clark didn't terminate with any actual proposal, and I'm not clear on > whether the above assertion is a fact. I'm not saying I know anything > about this field either. Does Lakota REALLY need a letter that isn't in > Unicode? > > Are you in a position to provide documents and evidence, and/or make a > definite proposal for adding this character? It would be a good thing > to add, if it's really needed. > > Rick > > Wednesdy, November 15, 2000 Page 311 under "Dakota (Sioux)" in Van Ostermann's Manual of foreign languages (full citation in Unicode 3.0 on page 1008) shows both captial and small N with long right leg curving to the left. They both are also under Sioux on page 253 of Giliarevskii's Languages identification guide (full citation in 3.0 at page 1005). To me they look like U+014A and U+014B (called 'eng'). Am I missing something? Regards, Jim Agenbroad ( [EMAIL PROTECTED] ) The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.
Re: Java and Unicode
One thing I'm very curious about going forward: Right now character values greater than 65535 are purely theoretical. However this will change. It seems to me that handling these characters properly is going to require redefining the char data type from two bytes to four. This is a major incompatible change with existing Java. There are a number of possibilities that don't break backwards compatibility (making trans-BMP characters require two chars rather than one, defining a new wchar primitive data type that is 4-bytes long as well as the old 2-byte char type, etc.) but they all make the language a lot less clean and obvious. In fact, they all more or less make Java feel like C and C++ feel when working with Unicode: like something new has been bolted on after the fact, and it doesn't really fit the old design. Are there any plans for handling this? -- +---++---+ | Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer | +---++---+ | The XML Bible (IDG Books, 1999) | | http://metalab.unc.edu/xml/books/bible/ | | http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/ | +--+-+ | Read Cafe au Lait for Java News: http://metalab.unc.edu/javafaq/ | | Read Cafe con Leche for XML News: http://metalab.unc.edu/xml/ | +--+-+
(no subject)
Hi, Is there any text editor by which data can be entered in Hindi? Rgds, Nikita K __ Do You Yahoo!? Yahoo! Calendar - Get organized for the holidays! http://calendar.yahoo.com/
Javascript code charts, unicode converter, show-characters
I just made some fixes in my Javascript Unicode pages (insomnia again) that may be of interest. http://www.macchiato.com/unicode/convert.html has UTF, RACE and LACE conversions, with a bit better error checking. http://www.macchiato.com/unicode/charts.html has Unicode charts, plus a new "filter" on the left. http://www.macchiato.com/unicode/show.html lets you type or paste in Unicode text, and see GIFs (in case fonts are missing). Feedback is welcome, though I make no apologies for the simple GUI. Mark (Paul noted that someone had registered "BQ--3AADEABQAAYAAMQAMYAGSADGABQ4NVFU3THPLTTUW" with VGRS. My program says it's an error -- it appears to have an extra W at the end. The source text appears to be hangul: 퀀퀲퀀퀰퀀퀰퀀퀲퀀큦퀀큩퀀큦퀀큡킶탔킴탔탎탵탎클)