Re: Forwarded question....

2002-08-29 Thread Torsten Mohrin

Barry Caplan <[EMAIL PROTECTED]> wrote:

>>I have a Japanese text file in Shift JIS and I need
>>to convert it to escaped Unicode. 
>By "escaped Unicode", she means "\u" format.

This type of conversion can also be done with UniPad
(http://www.unipad.org). Import file as "Shift-JIS", Save As "ASCII +
UCN", or Copy As "ASCII + UCN" via clipboard. UCN means Universal
Character Name (i.e. "\u" sequences). 

--Torsten





SC UniPad 0.99 (correct URLs)

2002-08-28 Thread Torsten Mohrin


The URLs to the screenshots of SC UniPad in my last message are
pointing to an internal server. 

I'm sorry for this mistake.

The correct URLs are:

http://www.unipad.org/techinfo/screenshots/editor.html
http://www.unipad.org/techinfo/screenshots/keyboard_layout.html
http://www.unipad.org/techinfo/screenshots/character_map.html


--
Torsten Mohrin, UniPad Team
Sharmahd Computing
http://www.unipad.org





SC UniPad 0.99

2002-08-28 Thread Torsten Mohrin

Dear Unicoders,

Because of the recent thread about UniPad, I think that not everybody
on this list knows about UniPad yet. Therefore I'm posting this
release note especially for the list members and the community of
Unicoders, summarizing the important features. I will also outline
what can not be done with UniPad yet.

UniPad by Sharmahd Computing is a plain text editor for Unicode,
running on the Microsoft operating systems Windows
95/98/ME/NT4/2000/XP.

It comes with a built-in bitmap font available in two styles: variable
width and fixed width. This font includes glyphs for almost 52000
characters covering the character repertoire of Unicode 3.2, except
Plane 2 ideographs. Gathering and installing fonts is not necessary.
Not all scripts are fully supported; by default a nominal glyph will
be used to depict a character, which is sufficient in many cases.

All scripts that do not require special text processing are supported.
Additionally, Arabic contextual form shaping is supported along with
bidirectional text (bidi). Arabic shaping and bidi can be turned on
and off easily, which can be quite useful.

Supported formats (i. e. encoding schemes) are: UTF-8, UTF-16, UTF-32,
UTF-7, ASCII + Universal Character Names (i. e. \u sequences),
Standard Compression Scheme for Unicode, ASCII + XML Character
References. Files can imported from and exported to several
single-byte and multi-byte encodings: ISO 8859, Windows codepages, DOS
codepages, Macintosh, KOI-8, VNI, VIQR, TCVN, VPS, VISCII, ISIRI-3342,
Shift-JIS, KS X 1001 (EUC-KR), Big Five, CNS 11643 (EUC-TW), GB 2312
(EUC-CN), JIS X 0208 (EUC-JP), ARMSCII-8, GEOSTD8, TIS-620.
Conversions can also be done through the clipboard using "Copy As" and
"Paste As" commands (a feature I use quite often myself).

Possible input methods are: clicking on a character map, direct hex
input, system keyboard (including installable Windows keyboards and
East Asian IMEs), built-in virtual keyboards, user-defined loadable
keyboards and certain third-party keyboard tools. About 60 built-in
keyboards are available. A virtual keyboard window allows visual
control of the selected keyboard and "typing" with the mouse.
User-defined keyboards may be created by dragging characters from the
character map to the keyboard window.

Individual display modes for certain character categories like spaces,
formatting characters, unassigned codepoints, unpaired surrogates and
such, can be changed seperately for each document. A statusbar shows
all relevant information about the character under the cursor: name,
block, category, bidi category, encoded byte sequence, etc.

More: multilevel undo/redo, search and replace, printing, sending
documents via email, several text conversions (uppercase, lowercase,
resolving \u sequences, combining, etc.), configurable BOM handling,
auto-detection and several common editor features.

The following things are not supported yet: shaping of Indic scripts
(like Devanagari), vertical editing (for CJK, Mongolian), built-in
keyboards with complex input methods (e.g. Tibetan or Ethiopian),
conjoining Hangul Jamo behaviour, visual combination of non-spacing
characters with base characters (however, pre-composed characters can
be typed using dead-key input method and explicit composing/
decomposing can be done), shaping of Syriac and Mongolian, variation
selectors, Plane 2 ideographs. I hope to soon provide a road-map
showing our schedule for implementing these missing features.

I guess, I forgot something to mention. So please check it out
yourself.

UniPad Home: http://www.unipad.org
Download: http://www.unipad.org/download
Screenshots:
http://www.unipad.org.cold/techinfo/screenshots/editor.html
http://www.unipad.org.cold/techinfo/screenshots/keyboard_layout.html
http://www.unipad.org.cold/techinfo/screenshots/character_map.html

Thank you for your interest.

--
Torsten Mohrin, UniPad Team
Sharmahd Computing
http://www.unipad.org





Re: SC UniPad 0.99 released.

2002-08-27 Thread Torsten Mohrin

Jungshik Shin <[EMAIL PROTECTED]> wrote:

>> http://www.unipad.org
>
> On several occasions, I heard  about it on this mailing list and finally
>my curiosity drove me to try it. Unfortunately, I was mightly
>disappointed.  At first, I was intrigued by their claim that it
>supports Hangul Jamos.  I've seen some false claims that Hangul
>Jamos is supported and wanted to see if it really support them. Well,
>it does not do any better than most other fonts/software that made that
>claim. It just treats them as 'spacing characters' instead of combining
>characters. Basically, it's useless except for making Unicode code chart
>(so is Arial MS Unicode.)

Well... :)

1. I confess that it has to be made clearer, what "support" actually
means. We will explain this more precisely. However, displaying Jamo
as separated characters is actually a certain level of support, while
non-support would be to display hollow boxes. Therefore the Jamo
support in UniPad is on a very basic level currently. But at least you
can see something.

2. Please keep in mind that software improves gradually. This is
version 0.99/1.0. Better support of certain scripts will be realised
in future versions. This is planned for Indic scripts and also for
Hangul.

3. If your definition of "support" is that strict, than I doubt that
you will be able to find any software that can claim to support
Unicode at all. 

4. You have the chance to evaluate the software, as you did. You are
free to decide not to use UniPad. I feel sorry, if it does not meet
your requirements. But I wouldn't say that it is useless. This depends
on your needs. For example, a hex editor is useless for the purpose of
writing a 200 page essay, sureley. Nevertheless, a hex editor is
without doubt a very useful tool.


>Then, I found its claim that it supports 300 languages(scripts). Wow !
>Does it properly support various South and Southeast Asian scripts?

Okay, okay :) We will define "support" more precisely.


>Again, it does not. It treats combining characters as spacing characters.
>I don't think users of those scripts would regard SC Unipad as supporting
>their scripts/languages.

You are right. I wouldn't write a letter to somebody in German where
the diaresis of an umlaut is displayed on the right side of the base
character. If I want to write a letter there are many word processors
out there which I can use. However, if I have (for instance) the need
to distinguish between 'u with diaresis' and 'u with double acute' I
may need an editor that is able to display those characters separated
and unambiguously. It's your decision whether you need such editor or
a word processor or some other Unicode editor.

I invite everybody to evaluate UniPad. If it's useful for your work,
fine. If not, please consider to re-evaluate it in a couple of month.
Maybe version 1.1 will provide what you need.

With best regards
--
Torsten Mohrin, UniPad Team
Sharmahd Computing
http://www.unipad.org





Re: Radicals in CNS 11643-1992, Plane 1, Rows 7,8,9

2002-07-02 Thread Torsten Mohrin

"John H. Jenkins" <[EMAIL PROTECTED]> wrote:

>Use the KangXi radicals in the KangXi radical block (U+2Fxx).

Hmm, that is pretty obvious. I should have noted that myself. Thanks!

--Torsten





Radicals in CNS 11643-1992, Plane 1, Rows 7,8,9

2002-07-01 Thread Torsten Mohrin


I need help from the CJK gurus:

I found that only 3 Han radicals from plane 1 rows 7, 8, 9 of CNS
11643-1992 are mapped to Unicode (UniHan.txt 3.2.0).

What should I do with these characters when converting CNS to Unicode?
Mapping to regular Han? Are there compatibility ideographs for
round-trip conversion?

(If this is documented somewhere, I obviously missed it. Please point
me to the right direction. Thanks.)

--Torsten





Re: FON fonts i18n

2000-10-06 Thread Torsten Mohrin

Roozbeh Pournader <[EMAIL PROTECTED]> wrote:

>Does anybody know the mechanism for adding i18n info to FON windows fonts? 

There is no i18n info in FON (bitmap) fonts, except the charset info
(dfCharSet of FONTDIRENTRY struct). The number of glyphs is limited to
256 with the restriction that a character code is directly mapped to
the glyph index. If you want a bitmap font to work on all Windows
platforms (and with all GDI drivers) you also have to choose Windows
2.0 format (as 16 bit executable) which has a limit of 64kB file size
per FNT file.

If you really _must_ use this font format you should make multiple FNT
files, treat them as stupid glyph collections and perform your own
character to glyph mapping. We did this in UniPad but we will use
another technique for upcoming versions, because of certain problems
with Win2K and the restrictions of this format. Anyway, I can give
some advise, if you need.

--Torsten




Re: codepages on Windows

2000-08-11 Thread Torsten Mohrin

[EMAIL PROTECTED] wrote:

>Anybody happen to know: Is there no Win32 API that allows you to determine
>a codepage given a LANGID or a charset value (i.e. one of the two
>parameters provided by WM_INPUTLANGCHANGE)?

wParam of WM_INPUTLANGCHANGE *is* the codepage ID (that you can pass
to MultiByteToWideChar(), for example).

--Torsten




Re: Bytes and Unicode

2000-07-25 Thread Torsten Mohrin

"john" <[EMAIL PROTECTED]> wrote:

>I much prefer the convention of
>SInt8, SInt16, SInt32, SInt64, SInt128...
>UInt8, UInt16, UInt32, UInt64, UInt128...
>SChar8, SChar16, SChar32...
>UChar8, UChar16, UChar32...
>so that whether the thing is signed or unsigned is explicit and
>tightly bound, as it were.

Whether they are named "SInt8", "S_INT_8", "sint8_t" depends on
personal taste, coding style and conventions. ISO C provides
"uintXX_t" for unsigned integers. I agree that it would be better also
to denote the signedness explicitly. But I have to deal with it.
Redefining (renaming) all identifiers that do not conform to my taste
is a fight I can't win.

These data types unambigously define the size of the integer in bits.
But for data interchange between different systems the byte order is
also an issue. So "int16_t" should have two variants "int16be_t" and
"int16le_t" and maybe "int16_t" is only the default of the actual
processor architecture. That would require special compiler support.

Does Java specify the byte order of the primitive data types? I don't
know. But I would guess no, for performance reasons.

--Torsten




Re: Abnormal Bytes and Unicode: (was Re: Unicode FAQ addendum)

2000-07-24 Thread Torsten Mohrin

Kenneth Whistler <[EMAIL PROTECTED]> wrote:

>So the first step to interoperability in big, interconnected system
>software using C is to set up fundamental header files containing
>well-defined datatypes of fixed sizes, to make up for the lack of same
>in the definition of C itself. The lack of fixed-size datatypes in C
>is now a *defect* in the language, and not an *asset* of the language.

The latest revision of ISO C has introduced exact-width integer types
(like "int8_t", "int16_t" and so on). These are also straightforward
names rather than "short", "BYTE" or "DWORD". 

--Torsten




Re: C # character model

2000-06-28 Thread Torsten Mohrin

Antoine Leca <[EMAIL PROTECTED]> wrote:

>Torsten Mohrin wrote:
>> Antoine Leca <[EMAIL PROTECTED]> wrote:
>> [...]
>> >> > APIs use and return single 16-bit values.
>> >
>> >Ah, that may be a problem (what is the ToUpper return value of ß?)
>> 
>> I don't know the mentioned API, but it could return 0x00DF or (to
>> indicate it as an error) 0x. I don't see a problem.
>
>The problem is that the "correct" answer is a two letter string, "SS".

You are right. Sorry for being so ignorant. Obviously I'm working in
ASCII mode today ;-)

--Torsten




Re: C # character model

2000-06-28 Thread Torsten Mohrin

Antoine Leca <[EMAIL PROTECTED]> wrote:

[...]
>> > APIs use and return single 16-bit values.
>
>Ah, that may be a problem (what is the ToUpper return value of ß?)

I don't know the mentioned API, but it could return 0x00DF or (to
indicate it as an error) 0x. I don't see a problem.

--Torsten




Symbol for hermaphrodite (was: Gender symbols)

2000-06-27 Thread Torsten Mohrin

Herman Ranes <[EMAIL PROTECTED]> wrote:

>In biology U+2640 is used as a 'female' symbol, and U+2642 as a 'male'
>symbol, as reflected by their UNICODE names...

In particular in botany and zoology also a symbol for hermaphroditic
animals (e.g. snails) and plants is used. It is a combination of
U+2640 and U+2642 (with only one circle). Maybe this is a canditate
for inclusion into Unicode.

--Torsten




Re: German Sharp-S, again (was: The mother of all collation schemes)

2000-06-16 Thread Torsten Mohrin


The Duden also allows to uppercase "ß" as "SZ" in ambiguous cases
(e.g. "MASSE" (Masse) vs. "MASZE" (Maße)). Moreover, in the German
Federal Armed Forces it is common to always uppercase "ß" as "SZ".

--Torsten