Re: [Freedos-devel] ASCII to unicode table

Eric Auer Tue, 30 Nov 2010 15:49:14 -0800

Hi Christian,

>> Should the translation be "accurate" or should it be "useful"?


That depends a lot on which languages we are talking about.

For the DISPLAYING of already existing strings such as file
names on some USB stick made by somebody using Linux, MacOS
or Windows, if your language is "something latin", you can
get reasonable results with a simplified display which just
drops accents from characters if your current codepage does
not have the needed accented char but has a similar char.

If you try the same with Russian, you will at least have to
switch to a Cyrillic codepage or maybe have both active at
the same time (VGA supports dual codepages: 512 chars). But
if our imaginary USB stick contains the Anime collection of
your Japanese friend, any attempt to display the file names
in any western or Cyrillic codepage will look really bad.

In the other direction, you may want to GENERATE strings in
Unicode. Of course KEYB, MKEYB and similar support switched
and local codepages. I assume that DOSLFN, KEYB and DISPLAY
can signal each other to let you use a suitable layout and
codepage to give your files Cyrillic names, display them in
the right way and read/write file names as UTF8 on your USB
stick... Somebody should check the documentation for more
details ;-). Yet again, try the same with ASIAN languages:

You would need an Input Method driver which lets you type
complex key sequences or combinations to type in a language
which has more than the usual few dozen chars of alphabet.

For CJK languages, you typically also need a wide font, the
usual 8 or 9 pixels of width will not usually be enough. So
you probably end up using a graphics mode CON driver or any
similar system, probably with a relatively big font with at
least 100s of different character shapes in RAM, maybe XMS.

> UTF-8 is independent of byte-order. The exact encoding (and byte-order)
> should always either be implicit (in the interface's or format's
> definition) or be marked in some way. The definition of a string's length
> (possibly number of bytes/words/dwords, number of code-points, number of
> "characters") need not be addressed by such an interface. If there is a
> need for a buffer or string length (see below) a new interface should just
> define that all "length" fields/parameters give the length in bytes.

I would also vote for UTF8: It keeps ASCII strings unchanged
and strings with only a few non-ASCII chars will only get a
few bytes longer, e.g. strings with accented chars in them.

In addition, you get a sort of graceful degradation: Tools
which are not Unicode-aware would treat the strings as if
they use some unknown codepage. So such tools would think
that AndrXX where XX is an encoding for an accented e has 6
characters but at least you can still see the "Andr" in it.

In the other direction, if you accidentally put in a text
with Latin1 or codepage 858 / 850 encoding, you get AndrY
where Y is the codepage style encoding of the accented "e"
and the Y and possibly one char after it would be shown in
a broken way by a CON driver which expects UTF8 instead.



As you already say, for BETTER compatibility, you always
have to be aware whether or not your string uses UTF8 or
codepage encoding. In theory you could also support DBCS
or UTF16-LE or similar, but I would vote against those.

This awareness will mean that you know how to RENDER the
string (e.g. switch fonts or mode of CON driver or use a
built-in rendering as in Blocek) and how many CHARACTERS
and BYTES the string is long and what is ONE CHARACTER,
for example for sorting or when you replace/edit a char.

As said, UTF8 has relatively graceful degradation, but
you still want explicit support for more heavy uses like
text editors, playlists, file managers and similar :-)

I do not understand the "codepoints are 24 bit numbers"
issue. Unicode chars with numbers above 65535 are very
exotic in everyday languages so I would not even start
to support them in DOS. If you mean UTF8, then what you
get is 2 bytes for characters from U+0080 to U+07ff and
3 bytes for characters from U+0800 to U+ffff - so only
for chars with numbers above 65535 you would need 4 or
even more bytes to UTF8 encode one character :-)

> define what Unicode encoding to use (UTF-8, -16BE, -16LE, -32BE, -32LE)

Luckily UTF8 is quite common and compact and byte order
independent. I think Mac / Office sometimes might use
one of the UTF16 encodings but otherwise they are not
so widespread. The UTF32 encodings are even VERY rare.

> apps have to figure out on their own what encoding their data uses.

That hopefully only affects text editors ;-)

Regards, Eric


------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel

Re: [Freedos-devel] ASCII to unicode table

Reply via email to