[Freedos-user] Unicode (It was 'Problem with USB keyboard in some computers')

Henrique Peron Tue, 05 Jul 2011 12:25:50 -0700

Hi all!
Saluton amiko!

Before I forget, I noticed that you do use ISO codepages.
I'll work on distinct packs of codepages and keyboard layouts for ISO 
8859-1 ~ 16.
>> While Unicode is huge, DOS keyboard layouts tend to be limited to
>> Latin and Cyrillic and some other symboly which is a tiny subset.
Nowadays, FreeDOS is able to work with the latin, cyrillic, greek, 
armenian and georgian alphabets, the cherokee syllabary and japanese.
>> If you do not count CJK and right-to-left languages and REALLY
>> exotic languages and symbols (maths, dingbats), Braille etc etc
>> then the number of Unicode characters that people are likely to
>> type on their keyboard in DOS is quite manageable. Of course it
>> is still fine to have a somewhat more complete font in DISPLAY.
> Right-to-left might be hard to do (I guess?), but technically as long
> as they can see and enter what they want, I'm sure they can get used
> to left-to-right. BTW, there was an old Forth for DOS with Korean font (...)
Excuse me? How can anyone type the arabic, syriac or hebrew abjads from 
left to right? *That* would be really exotic, if ever possible! :-)
Visually speaking, if an eventual reader doesn't know hebrew (or 
yiddish, or ladino, etc.), he might not know if a text is correctly 
(right-to-left) or incorrectly (left-to-right) typed because the letters 
don't connect to each other. On the other hand, abjads like arabic and 
syriac have most of their letters shaped in a way that they connect to 
each other - always from right to left.
>>> that). And then I (erroneously?) thought BMP ("basic multilingual
>>> plane") was the easy, two-byte Western portion, but apparently that's
>>> not true.
Well - that might be true. Under Unicode, if you use UCS-2 encoding, all 
characters in the BMP are represented by 2 bytes. Period. UCS-2 is 
proven to be very good for CJK text because even when they need regular 
(non-accented) latin letters and digits, they are encoded as "fullwidth" 
(double-byte) on a distinct block in the BMP. All CJK glyphs, if stored 
under the UTF-8, use 3 bytes.
UCS-2 is also good for all abugidas (devanagari, bengali, etc.) because 
it would also be needed 3 bytes per glyph under UTF-8 for those scripts.
UTF-8 is best suited for languages written with the latin alphabet 
because a text encoded like that would oscilate between 1-3 bytes per 
char. Yes, 3 bytes, because many punctuation marks, currency signs, 
etc., are located above the 07FFh codepoint - when UTF-8 starts needing 
3 bytes per glyph. Medieval texts also rely heavily on the Latin 
Extended-D block, which is way above the 07FFh boundary.
The downside of UCS-2 is being limited to the BMP while on the other 
hand there are (in practice) no limitations for UTF-8.
>>> 1). Chinese (hard)
>> See above.
> We'd have to ask someone "in the know", e.g. Johnson Lam. I think he
> had some primitive workaround for PG.
>>> 4). Arabic (easy??)
>> Unicode lists maybe 300 chars for that, at most.
If we restrict ourselves to the arabic language, I can tell you that it 
is much less.
If we mean the arabic abjad - and then it comes around 100 languages 
that use it like persian, urdu, sindhi, uyghur or used it either in the 
middle ages by force of the moor invasion like portuguese, spanish, 
croatian, belarusian or used it in Africa (like hausa) and Asia (like 
turkish, azeri, etc.),... I can tell you that we're talking about much 
more than 300 chars.
> Really? Wikipedia lists 28 char alphabet (single case), IIRC.
Yes - but there's a catch here. Let's think on the glyphs. Letters in 
the latin alphabet have two distinct shapes (upper- and lowercase) and, 
considering that, the regular latin alphabet is comprised of "52" chars. 
The arabic abjad, by its nature, provides up to 4 distinct shapes per 
letter. If we consider the uyghur language, which uses the arabic abjad 
as a regular alphabet (i.e. full representation of vowels), there are up 
to eight shapes per letter, because uyghur is unique among languages 
which use the arabic abjad in that it has digraphs as part of its 
alphabet, like hungarian has "ZS" (ZS, Zs, zs) or czech has "CH" (CH, 
Ch, ch).
>>> 5). Hindi
>> The writing system is "Devanagari", case insensitive,
>> has ligatures, not many characters, like Bengali?
> Apparently the Sanskrit alphabet, aka Deva-nagari or just Nagari. Has
> some interesting workarounds (e.g. ISCII, I think).
>> Similar to what happens with Cyrillic, there is ISCII
>> which puts ASCII and Devanagari together in 256 chars,
>> even with Bengali and some other scripts (approx?).
> There you go, you saw Wikipedia too!   ;-)
>>> 6). Bengali
>> Apparently has ligatures and is case-insensitive?
> Aka, Bangla (from Bangladesh), uses Eastern Nagari (similar but not
> same). Looks like it could fit in a code page! Interesting workarounds
> include IAST and ITRANS.
ISCII apparently relies on subfonts and probably only worked in graphics 
mode. I imagine that because of the complex shapes of letters from 
abugidas like tamil, malayalam or telugu. There's absolutely no way of 
drawing them into a tiny 16x8 dot matrix, as it is for latin or cyrillic 
letters. MS/IBM DOS provided codepage 806 on those days but it only 
provides aksharas of the devanagari script and even then, only the 
regular forms. All scripts of the indian subcontinent make heavy use of 
conjuncts. Think on them as if there was a distinct glyph for hungarian 
"zs" or czech "ch" but visually so distinct that, unless you were used 
to it, you would never tell that it referred to "z+s" or "c+h". It goes 
further: there are conjuncts which represent the gathering of 3, 4 or 
even 5 aksharas (the characters of the abugidas). Interesting is: if we 
check every single codepoint of Unicode, we won't ever find any 
conjunct. They're all internally encoded into  fonts like "Mangal", 
"Vrinda" and others (I personally recommend "Chandas" as I think it to 
be far more beautiful).


ISCII codepage 806 had a control character in its upper half called 
"ATR" (codepoint EFh). That was the catch. If the user wanted to type 
assamese or gujarati or bengali instead of hindi (after all, the default 
character set of ISCII is devanagari), he had to press some <Ctrl> + 
<key> combination so that DOS accessed an internal font. From that 
moment on, if the user typed, let's say, whatever the key was for the 
akshara (semisyllable) "ka", he would not see a devanagari "ka", but a 
gujarati "ka", a telugu "ka", a malayalam "ka", etc. I still wander, 
though, how ISCII dealt with conjuncts (in what comes to codepoints). If 
we multiply the number of conjuncts by the number of abugidas in the 
indian subcontinent, we easily have thousands of distinct glyphs.

Another aspect on the nature of abugidas is that they use (what we would 
call) diacritics but, unlike "á", "ë", etc., there are no precomposed 
accented aksharas either on Unicode or internally encoded into fonts. 
Unless ISCII also accessed subfonts with all possible precomposed 
aksharas and conjuncts of all abugidas (which would easily amount to 
tens of thousands), that means that all diacritics were standalone chars 
working as the diacritics found on the combining diacritics block.

My conclusion: either there was a wholly tailored MS/IBM-DOS for India 
on those days or there were particular COM/EXE programs that would put 
any regular DOS on graphics mode so to handle ISCII.
>>> 7). Portuguese (easy)
>> Indeed.
> Henrique!!!
Yes? :)

Easy indeed, naturally. As any other language which requires the latin 
alphabet, it requires the Unicode 1-byte chars on the first Unicode's 
BMP block ("C0 Controls and Basic Latin", 00h-7Fh), then it needs a 
particular set of precomposed 2-byte chars on the following blocks 
(portuguese particularly requires only a few precomposed chars on the 
very next block, "C1 Controls and Latin-1 Supplement", 80h - FFh) and 
finally, if it is a brazilian user and needs to type the glyph for our 
formerly used Cruzeiro (in some historical context) or a portuguese user 
which needs to type the Euro, those 3-byte glyphs are available on a 
block called "Currency Symbols" (20A0h-20CFh). Either brazilian or 
portuguese, if he is to follow strict typography rules, he will need a 
few 3-byte glyphs on "General Punctuation" (2000h-206Fh). (Emphasizing: 
References to 1 ~ 3-byte chars along this paragraph all according to 
UTF-8 encoding.)

Important to mention is that english is generally regarded as 
"pure-ASCII" but we must consider the fair amount of foreign words (like 
"café") and the need of accented/special chars used in middle and old 
english, therefore the english language (as much as german, french or 
any other latin-alphabet-based language) also falls in the same 
situation as portuguese.
>>> 8). Russian (easy??)
>> The well-known cyrillic codepages squeeze ASCII and Cyrillic
>> (probably not all theoretically possible accents) in 256 chars.
> Probably like others only includes the "important" stuff.
"Important" is a complicated word to use here... For a russian? Yes. He 
can even choose. All cyrillic letters for the russian language are 
available on the well-known codepage 866, along with 808, 855, 872, 771 
and many others. If we think on macedonian or serbian users though, they 
would call "important" codepages 855 and 872 only (never 866), since 
they're the only ones released by the industry on those days to provide 
the distinct cyrillic serbian and macedonian letters. There is not a 
particular cyrillic codepage which gathers all letters for russian, 
belarusian, ukrainian, macedonian and serbian letters and I didn't even 
mention the needs of the official languages of Russia's internal 21 
republics.

In what comes to storage (and UTF-8), russian needs the regular latin 
digits (1 byte each) and the cyrillic letters (2 bytes each char); if we 
think on cyrillic needs in general, then we also have the ukrainian 
hryvnia currency sign, a 3-byte char (again, "Currency Symbols", 
2000h-206Fh). Yet, if we think on text written in old church slavonic, 
then we break the 07FFh boundary and use 3-byte chars from "Cyrillic 
Extended-A" (2DE0h-2DFFh) and "Cyrillic Extended-B" (A640h-A69Fh) or, if 
written with the glagolitic alphabet, then the "Glagolitic" block is 
needed (2C00h-2C5Fh).

In the end, cyrillic (as well as georgian, armenian, greek, coptic) ends 
up by falling in the same case as (and therefore as easy to handle as) 
portuguese, english, swedish, etc.: One to three bytes per char, under 
UTF-8.
>>> 9). Japanese (hard)
>> See above.
> I didn't even look this one up, but I vaguely remember reading once
> that they use two or three scripts (ugh): hiragana, kanji, etc. (EDIT:
> Seems I forgot katakana.)
Strictly speaking, they also use "romaji" (our good-old latin alphabet) 
eventually. It is not uncommon, particularly among technical literature, 
to find acronyms or digits or certain words in a japanese text (well, it 
also applies to chinese and korean, therefore I should say "CJK text"). 
As I said before, when it comes to computers, they generally use the 
regular latin letters and digits found on the "Halfwidth and fullwidth 
forms" (FF00h-FFEFh) and not the chars found on 00-7Fh.
>>> own scripts are a problem, not to mention those like CJK that have
>>> thousands of special characters. (e.g. Vietnamese won't fit into a
>>> single code page, even.)
Actually, it does. There was a standard called VISCII on the old days. 
It has been available for FreeDOS for a while already. The catch is: due 
to the hugh amount of necessary precomposed chars (134), there are no 
linedraw, shade, block or any other strictly non-vietnamese precomposed 
char on the upper half of VISCII and 6 less-used control chars on the 
lower half had their glyphs traded for the remaining 6 precomposed 
vietnamese accented latin letters.
>> When you have Unicode, you do not need codepages.
> Right. And when you have a 286 or 386, you don't need to limit to 1 MB
> of RAM.   ;-))
Furthermore, due to the number of glyphs (and the shape complexity of 
many of them), I can only imagine Unicode working on graphics mode and 
that will certainly complicates matters for very old computers... Unless 
it be considered some sort of "sub-Unicode" support for them, focusing 
only on latin, cyrillic, greek, armenian and georgian alphabets because 
their letters can easily fit on regular codepages and they cover the 
needs of the majority of world's languages. That could be the best 
possible workaround.

I'm also working on the arabic and hebrew abjads - to work particularly 
under Mined. Codepages 856 and 862 (hebrew) and 864 (arabic) have been 
ready for a long time already but I had never seen a way to use them 
until I found out about Mined. So far, under request, I have only 
prepared a phonetic spanish/arabic keyboard layout. Since it was a 
particular need (instead of a regular standard), it will be not released 
in the keyboard layout pack for FreeDOS - unless, naturally, I'm told 
that many users would need it.
>> you either have to encode Unicode (or similar encodings) as
>> 16 bit characters with DBCS, possibly even 2 per character in
>> the surrogate case for 20 bit encodings (Linear B or old Sumer
>> Cuneiform from 3000 BC anybody? :-D) or as sequence of single
>> bytes in UTF-8. The latter is convenient because frequent DOS
>> charsets like ASCII or Latin need only 1-2 bytes while you can
>> still encode up to 31 bits: U+07FF still fits 2 bytes and all
>> 16 bit chars need only 3 bytes, the rest is very rare...
> I think the real (proposed) advantage is that it doesn't waste space
> if your main language(s) are Western. Also the byte stream is
> recoverable if interrupted (so you can tell and resume at next valid
> char). I think.  :-/
>> Any software which tries to do layout (say, line wrapping or
>> tables) has to understand how UTF-8 encodes 1 character as 1-
>> or-more bytes, otherwise the layout gets messy. Still, if you
>> have a DISPLAY with UTF-8 support, all ASCII (0-127) will be
>> as normal and compatible with any ancient software :-)
> Ugh, such a pain. But we do have some Unicode-aware tools (e.g. JED
> 0.99.16+ or VILE or GNU Emacs and of course Mined). I also know that
> OpenWatcom's vi became 8-bit friendly not too long ago.
>
> Well, old computer languages like Ada83 were 7-bit only, but later
> Ada95 was 8-bit friendly (and even Modula-3 defaulted CHAR to
> Latin-1). But some (like Java) default to UTF-16 (or maybe UCS-16, is
> there a difference?). I'm not sure why I felt the need to mention it,
> just saying, "it depends" (and is complicated). Perhaps my point is
> that it wasn't urgent to support "everything" then and probably isn't
> now either.  :-P
>
>>> Nevertheless, perhaps some way of combining would make the most sense
>>> to me, at least for Latin / Roman alphabets. 'a' + macron or 'a' +
>>> circumflex or whatever. Then you wouldn't have to store ten million
>>> redundant letters that only differ in accents...
>>
>> On one hand, it saves time with font design.
>
> That's what I was thinking. And yet I was despairing more and more, I
> even wondered if just supporting IPA directly would save time / space
> somehow.   o_O     Doubt it, approx. 157 chars needed (too big for a
> code page unless you cut out part of the ASCII compatibility). I'm
> probably way off base with reality here, just thinking outloud.
>
>> On the other, now
>> that you mention it, Unicode also has COMBINING characters, in
>> particular of course diacritics. You put those after any char,
>> yet you see them in the same column as the char.
>
> Right, but most Unicode-aware software isn't combining friendly (last I 
> heard).
>
>> Some chars can
>> even have multiple diacritics. Yet if your font cannot combine,
>> or if the combination does not make sense, software tends to
>> display the accent AFTER the character as separate char. Also,
>> you can "normalize" the combinations together. In particular in
>> Latin codepage languages, the combination of char plus accent
>> very often already exists as ONE character so software which
>> can figure that out does not need the ability to graphically
>> combine chars with separately stored diacritics in the font.
>
> Well, I was just thinking how to save space. We don't need a 10 MB
> file to lug around, do we? (Well, probably .... And not that anybody
> would complain, as long as it worked.)
>
>> Coincidentally, I wrote a little program which does such a
>> normalization in Java (but hey, that is almost C)
>
> <off-topic>
>
> Somebody recently ported DOSBox to Java, BTW, so it's not "that"
> different to standard C/C++.    http://jdosbox.sf.net
>
> We used to have Kaffe for DOS, but I never tried it (old!).
>
> BTW, even OS/2 (eCS) got recent Java port, so if they can get one,
> anything's possible!!
>
> </off-topic>
>
>>> Probably easier to just tell them, "Use ICONV.EXE" (or Mined, Blocek,
>>> Foxtype, etc).   :-)
>>
>> Of course - conversion and graphical Unicode text editors like Blocek
>> will work fine and without limitation to 256 chars per codepage :-)
>
> But do most people even view or edit multiple languages (of different
> families) concurrently???
>
>> Leads to the question what else you want to do with Unicode, and one
>> such thing will be file names.
>
> (BARF!) "Modern" software still can't even handle spaces, dollar
> signs, periods, tildes, exclamation points, and other "weird"
> characters, much less Unicode.
>
>>> Well, we'd have to rebuild those programs. But that means C with
>>> widechar support, and I'm not sure which compilers support that.
>>
>> That depends a lot on which programs we really want to recompile.
>> And I would not be surprised if OpenWatcom or DJGPP had wide chars.
>
> I don't know, but I'm pretty sure DJGPP doesn't (or not well, at
> least). Not sure about OW since it might (old Japanese compiler texts
> ??).
>
>> Even if not, not many methods will break if you treat UTF-8 as if
>> it were 1 byte per char. Of course doing a substring or similar and
>> cutting 2 or more bytes which are part of the same char apart will
>> mean that your result will be invalid UTF-8 and will look trashy.
>>
>> Still, a few carefully chosen macros could be enough to make some
>> sort of UTF-8 support toolkit even for non-Unicode compilers, so
>> you could more easily port your software with help of the macros.
>
> But developers can't even be bothered to do simple things already, so
> it's unlikely they want more "workarounds" (sadly). But hey, that's
> their problem.
>
>>>> PS: That DISPLAY could store a big Unicode font in XMS and cache a
>>>> number of recently used chars, or run entirely in protected mode.
>>>
>>> XMS already assumes 286, so jumping to 386 pmode wouldn't be a far
>>> stretch. (I would be surprised if anybody besides Japheth understands
>>> 286 pmode these days. It's certainly 1000x less popular than 386
>>
>> Correct, but one would have to check the performance of that. Yet
>> both XMS and software EMS have overhead and given the very coarse
>> granularity of EMS (4k or 16k) it might not be that cool for other
>> people apart from Jim Leonard with his 8088 with EMS ISA card ;-)
>
> I forgot that the DPMI standard supports 286 and 386, but writing a
> TSR for DPMI is pretty much hard to (not quite) impossible (and ugly).
> I know we're not necessarily saying TSR here, and 286 pmode tools are
> fairly rare, but still .... At least most DOS extenders support
> various kinds of memory schemes.
>
> ------------------------------------------------------------------------------
> All of the data generated in your IT infrastructure is seriously valuable.
> Why? It contains a definitive record of application performance, security
> threats, fraudulent activity, and more. Splunk takes this data and makes
> sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-d2d-c2
> _______________________________________________
> Freedos-user mailing list
> Freedos-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/freedos-user

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Freedos-user mailing list
Freedos-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-user

[Freedos-user] Unicode (It was 'Problem with USB keyboard in some computers')

Reply via email to