Re: Forwarded question....
Barry Caplan <[EMAIL PROTECTED]> wrote: >>I have a Japanese text file in Shift JIS and I need >>to convert it to escaped Unicode. >By "escaped Unicode", she means "\u" format. This type of conversion can also be done with UniPad (http://www.unipad.org). Import file as "Shift-JIS", Save As "ASCII + UCN", or Copy As "ASCII + UCN" via clipboard. UCN means Universal Character Name (i.e. "\u" sequences). --Torsten
SC UniPad 0.99 (correct URLs)
The URLs to the screenshots of SC UniPad in my last message are pointing to an internal server. I'm sorry for this mistake. The correct URLs are: http://www.unipad.org/techinfo/screenshots/editor.html http://www.unipad.org/techinfo/screenshots/keyboard_layout.html http://www.unipad.org/techinfo/screenshots/character_map.html -- Torsten Mohrin, UniPad Team Sharmahd Computing http://www.unipad.org
SC UniPad 0.99
Dear Unicoders, Because of the recent thread about UniPad, I think that not everybody on this list knows about UniPad yet. Therefore I'm posting this release note especially for the list members and the community of Unicoders, summarizing the important features. I will also outline what can not be done with UniPad yet. UniPad by Sharmahd Computing is a plain text editor for Unicode, running on the Microsoft operating systems Windows 95/98/ME/NT4/2000/XP. It comes with a built-in bitmap font available in two styles: variable width and fixed width. This font includes glyphs for almost 52000 characters covering the character repertoire of Unicode 3.2, except Plane 2 ideographs. Gathering and installing fonts is not necessary. Not all scripts are fully supported; by default a nominal glyph will be used to depict a character, which is sufficient in many cases. All scripts that do not require special text processing are supported. Additionally, Arabic contextual form shaping is supported along with bidirectional text (bidi). Arabic shaping and bidi can be turned on and off easily, which can be quite useful. Supported formats (i. e. encoding schemes) are: UTF-8, UTF-16, UTF-32, UTF-7, ASCII + Universal Character Names (i. e. \u sequences), Standard Compression Scheme for Unicode, ASCII + XML Character References. Files can imported from and exported to several single-byte and multi-byte encodings: ISO 8859, Windows codepages, DOS codepages, Macintosh, KOI-8, VNI, VIQR, TCVN, VPS, VISCII, ISIRI-3342, Shift-JIS, KS X 1001 (EUC-KR), Big Five, CNS 11643 (EUC-TW), GB 2312 (EUC-CN), JIS X 0208 (EUC-JP), ARMSCII-8, GEOSTD8, TIS-620. Conversions can also be done through the clipboard using "Copy As" and "Paste As" commands (a feature I use quite often myself). Possible input methods are: clicking on a character map, direct hex input, system keyboard (including installable Windows keyboards and East Asian IMEs), built-in virtual keyboards, user-defined loadable keyboards and certain third-party keyboard tools. About 60 built-in keyboards are available. A virtual keyboard window allows visual control of the selected keyboard and "typing" with the mouse. User-defined keyboards may be created by dragging characters from the character map to the keyboard window. Individual display modes for certain character categories like spaces, formatting characters, unassigned codepoints, unpaired surrogates and such, can be changed seperately for each document. A statusbar shows all relevant information about the character under the cursor: name, block, category, bidi category, encoded byte sequence, etc. More: multilevel undo/redo, search and replace, printing, sending documents via email, several text conversions (uppercase, lowercase, resolving \u sequences, combining, etc.), configurable BOM handling, auto-detection and several common editor features. The following things are not supported yet: shaping of Indic scripts (like Devanagari), vertical editing (for CJK, Mongolian), built-in keyboards with complex input methods (e.g. Tibetan or Ethiopian), conjoining Hangul Jamo behaviour, visual combination of non-spacing characters with base characters (however, pre-composed characters can be typed using dead-key input method and explicit composing/ decomposing can be done), shaping of Syriac and Mongolian, variation selectors, Plane 2 ideographs. I hope to soon provide a road-map showing our schedule for implementing these missing features. I guess, I forgot something to mention. So please check it out yourself. UniPad Home: http://www.unipad.org Download: http://www.unipad.org/download Screenshots: http://www.unipad.org.cold/techinfo/screenshots/editor.html http://www.unipad.org.cold/techinfo/screenshots/keyboard_layout.html http://www.unipad.org.cold/techinfo/screenshots/character_map.html Thank you for your interest. -- Torsten Mohrin, UniPad Team Sharmahd Computing http://www.unipad.org
Re: SC UniPad 0.99 released.
Jungshik Shin <[EMAIL PROTECTED]> wrote: >> http://www.unipad.org > > On several occasions, I heard about it on this mailing list and finally >my curiosity drove me to try it. Unfortunately, I was mightly >disappointed. At first, I was intrigued by their claim that it >supports Hangul Jamos. I've seen some false claims that Hangul >Jamos is supported and wanted to see if it really support them. Well, >it does not do any better than most other fonts/software that made that >claim. It just treats them as 'spacing characters' instead of combining >characters. Basically, it's useless except for making Unicode code chart >(so is Arial MS Unicode.) Well... :) 1. I confess that it has to be made clearer, what "support" actually means. We will explain this more precisely. However, displaying Jamo as separated characters is actually a certain level of support, while non-support would be to display hollow boxes. Therefore the Jamo support in UniPad is on a very basic level currently. But at least you can see something. 2. Please keep in mind that software improves gradually. This is version 0.99/1.0. Better support of certain scripts will be realised in future versions. This is planned for Indic scripts and also for Hangul. 3. If your definition of "support" is that strict, than I doubt that you will be able to find any software that can claim to support Unicode at all. 4. You have the chance to evaluate the software, as you did. You are free to decide not to use UniPad. I feel sorry, if it does not meet your requirements. But I wouldn't say that it is useless. This depends on your needs. For example, a hex editor is useless for the purpose of writing a 200 page essay, sureley. Nevertheless, a hex editor is without doubt a very useful tool. >Then, I found its claim that it supports 300 languages(scripts). Wow ! >Does it properly support various South and Southeast Asian scripts? Okay, okay :) We will define "support" more precisely. >Again, it does not. It treats combining characters as spacing characters. >I don't think users of those scripts would regard SC Unipad as supporting >their scripts/languages. You are right. I wouldn't write a letter to somebody in German where the diaresis of an umlaut is displayed on the right side of the base character. If I want to write a letter there are many word processors out there which I can use. However, if I have (for instance) the need to distinguish between 'u with diaresis' and 'u with double acute' I may need an editor that is able to display those characters separated and unambiguously. It's your decision whether you need such editor or a word processor or some other Unicode editor. I invite everybody to evaluate UniPad. If it's useful for your work, fine. If not, please consider to re-evaluate it in a couple of month. Maybe version 1.1 will provide what you need. With best regards -- Torsten Mohrin, UniPad Team Sharmahd Computing http://www.unipad.org
Re: Radicals in CNS 11643-1992, Plane 1, Rows 7,8,9
"John H. Jenkins" <[EMAIL PROTECTED]> wrote: >Use the KangXi radicals in the KangXi radical block (U+2Fxx). Hmm, that is pretty obvious. I should have noted that myself. Thanks! --Torsten
Radicals in CNS 11643-1992, Plane 1, Rows 7,8,9
I need help from the CJK gurus: I found that only 3 Han radicals from plane 1 rows 7, 8, 9 of CNS 11643-1992 are mapped to Unicode (UniHan.txt 3.2.0). What should I do with these characters when converting CNS to Unicode? Mapping to regular Han? Are there compatibility ideographs for round-trip conversion? (If this is documented somewhere, I obviously missed it. Please point me to the right direction. Thanks.) --Torsten
Re: FON fonts i18n
Roozbeh Pournader <[EMAIL PROTECTED]> wrote: >Does anybody know the mechanism for adding i18n info to FON windows fonts? There is no i18n info in FON (bitmap) fonts, except the charset info (dfCharSet of FONTDIRENTRY struct). The number of glyphs is limited to 256 with the restriction that a character code is directly mapped to the glyph index. If you want a bitmap font to work on all Windows platforms (and with all GDI drivers) you also have to choose Windows 2.0 format (as 16 bit executable) which has a limit of 64kB file size per FNT file. If you really _must_ use this font format you should make multiple FNT files, treat them as stupid glyph collections and perform your own character to glyph mapping. We did this in UniPad but we will use another technique for upcoming versions, because of certain problems with Win2K and the restrictions of this format. Anyway, I can give some advise, if you need. --Torsten
Re: codepages on Windows
[EMAIL PROTECTED] wrote: >Anybody happen to know: Is there no Win32 API that allows you to determine >a codepage given a LANGID or a charset value (i.e. one of the two >parameters provided by WM_INPUTLANGCHANGE)? wParam of WM_INPUTLANGCHANGE *is* the codepage ID (that you can pass to MultiByteToWideChar(), for example). --Torsten
Re: Bytes and Unicode
"john" <[EMAIL PROTECTED]> wrote: >I much prefer the convention of >SInt8, SInt16, SInt32, SInt64, SInt128... >UInt8, UInt16, UInt32, UInt64, UInt128... >SChar8, SChar16, SChar32... >UChar8, UChar16, UChar32... >so that whether the thing is signed or unsigned is explicit and >tightly bound, as it were. Whether they are named "SInt8", "S_INT_8", "sint8_t" depends on personal taste, coding style and conventions. ISO C provides "uintXX_t" for unsigned integers. I agree that it would be better also to denote the signedness explicitly. But I have to deal with it. Redefining (renaming) all identifiers that do not conform to my taste is a fight I can't win. These data types unambigously define the size of the integer in bits. But for data interchange between different systems the byte order is also an issue. So "int16_t" should have two variants "int16be_t" and "int16le_t" and maybe "int16_t" is only the default of the actual processor architecture. That would require special compiler support. Does Java specify the byte order of the primitive data types? I don't know. But I would guess no, for performance reasons. --Torsten
Re: Abnormal Bytes and Unicode: (was Re: Unicode FAQ addendum)
Kenneth Whistler <[EMAIL PROTECTED]> wrote: >So the first step to interoperability in big, interconnected system >software using C is to set up fundamental header files containing >well-defined datatypes of fixed sizes, to make up for the lack of same >in the definition of C itself. The lack of fixed-size datatypes in C >is now a *defect* in the language, and not an *asset* of the language. The latest revision of ISO C has introduced exact-width integer types (like "int8_t", "int16_t" and so on). These are also straightforward names rather than "short", "BYTE" or "DWORD". --Torsten
Re: C # character model
Antoine Leca <[EMAIL PROTECTED]> wrote: >Torsten Mohrin wrote: >> Antoine Leca <[EMAIL PROTECTED]> wrote: >> [...] >> >> > APIs use and return single 16-bit values. >> > >> >Ah, that may be a problem (what is the ToUpper return value of ß?) >> >> I don't know the mentioned API, but it could return 0x00DF or (to >> indicate it as an error) 0x. I don't see a problem. > >The problem is that the "correct" answer is a two letter string, "SS". You are right. Sorry for being so ignorant. Obviously I'm working in ASCII mode today ;-) --Torsten
Re: C # character model
Antoine Leca <[EMAIL PROTECTED]> wrote: [...] >> > APIs use and return single 16-bit values. > >Ah, that may be a problem (what is the ToUpper return value of ß?) I don't know the mentioned API, but it could return 0x00DF or (to indicate it as an error) 0x. I don't see a problem. --Torsten
Symbol for hermaphrodite (was: Gender symbols)
Herman Ranes <[EMAIL PROTECTED]> wrote: >In biology U+2640 is used as a 'female' symbol, and U+2642 as a 'male' >symbol, as reflected by their UNICODE names... In particular in botany and zoology also a symbol for hermaphroditic animals (e.g. snails) and plants is used. It is a combination of U+2640 and U+2642 (with only one circle). Maybe this is a canditate for inclusion into Unicode. --Torsten
Re: German Sharp-S, again (was: The mother of all collation schemes)
The Duden also allows to uppercase "ß" as "SZ" in ambiguous cases (e.g. "MASSE" (Masse) vs. "MASZE" (Maße)). Moreover, in the German Federal Armed Forces it is common to always uppercase "ß" as "SZ". --Torsten