[NTG-context] towards some more consistency in regimes & unicode support

Mojca Miklavec Tue, 13 Sep 2005 08:12:27 -0700


Hello,

Sorry for a slightly longer mail. I wanted to send it to context-dev,but probably there's someone else besides Adam out there who couldcontribute (for example to re-chech Greek or Cyrillic section of Unicodeor even add some missing Hebrew definitions for example). If someonethinks that it's more appropriate, please feel free to continue thediscussion on context-dev.



I. in regi-utf it would be fine to add:

\defineregimesynonym[utf-8][utf]
\defineregimesynonym[utf8][utf]

II. After a long time I finally decided to write my first ruby script. Itook UnicodeData.txt, adobe glyph list, enco-uc.tex, collectedaverything together, removed characters >FFFF (in case someone needsthem they can trivially be added again, but I don't think that anyone isplanning to name them shortly), did some manual corrections ... and hereare the results:

    http://pub.mojca.org/tex/enco/contextlist/
    http://pub.mojca.org/tex/enco/contextbase/regi-temp.tex

The idea behind is that there is no "definite refence" to the ConTeXtglyph names, which means that every new regime that should be supportedneeds a lot of manual work and leads to many inconsistencies.

The file contextnames.txt contains the Unicode hexadecimal number, pdfname (from Adobe Glyph List), ConTeXt name and the Unicode name. Thiscould then be a source of information when adding new regimes, writingunicode vectors (unic-*), mapping to font encodings,uppercasing/lowercasing information for font encoding and other filescan now be derived directly from unicode and this list (unicode alreadycontains information about upper/lowercase variants of the letters) ...

There is some more info missing, which should be either packed withinthe same file or in separate files:

- ConTeXt synonyms (like \Dcroat -> \Dstroke, ...)

- pdf synonyms (dbar -> dcroat), to help recognize the glyphs in .enc or.afm and automate support for it

- faking the characters (\ccaron -> \buildtextaccent\textcaron{C})
- unaccented version of the characters (\Aacute -> A, ...)

- other characters not present in unicode (Caron, Acute - these areaccents for uppercase letters, ...)- (I'm sure that I wanted to add some more points, but I don't rememberany other right now)

When I wanted to add the names from unic-34.tex, I realized that wedon't really need to have a command for "every single unicode character"(we certainly don't need to map math characters into that region), butif someone already has a file with unicode integrals, it costs nothingto give him those characters in output.(Shortly: 0x2211, "N-ARY SUMMATION" should expand into $\sum$, but notthe other way round)I have to slightly change the syntax in the context glyph names file tonote this difference and to be able to define math (and other) signsproperly.


------------------------------------------------------------------------

III. Now I need some help - someone should help me revise the filecontextname.txt (I prepared a HTML version of it): correct mistakes (ifany are spotted), add new definitions, help to prepare a list ofsynonyms, a list of expansions (\buildtextaccent), ...

------------------------------------------------------------------------

Here are some points which I spotted, but can't fix them alone

1. Characters missing (needed by some regimes):

0020-007F section

037A GREEK YPOGEGRAMMENI
0384 GREEK TONOS
0385 GREEK DIALYTIKA TONOS
2015 HORIZONTAL BAR
2017 DOUBLE LOW LINE
20AA NEW SHEQEL SIGN
20AB DONG SIGN
20AF DRACHMA SIGN
2116 NUMERO SIGN
200E LEFT-TO-RIGHT MARK
200F RIGHT-TO-LEFT MARK

1Exx section

2. Greek - there are some name inconsistencies when compared to theunic-031 vector, but I don't know anything about old greek. I didn'tcheck Cyrillic at all.

3. Punctuation and accents - mostly names for quotes and languagedependency (lowerleftuppersixquote in comparison to lftdblquote ... orwhatever they are called) (+ tricks, I already asked about quotes &hyphenation approximately a week ago).I have problems understanding the difference between letter modifiers(U+02Cx) and usual accents (U+00Bx), "Combining Diacritical Marks"(U+03xx) should be supported somehow as well. I have no idea how to makeU+0065 U+0301 (e + combining acute accent) into eacute.

4. should hungarumlaut be doubleacute and hungarumlaut only its synonymor the other way round?


5. tbar vs. tstroke: compare 0166 and 023E

6. cedilla/commaaccent dilema: there's a huge problem with "t withcedilla" (0162): "t with comma below" (021A) sould be used instead (atleast this is stated in Unicode reference), but most regimes map acharacter to "t with cedilla" (0162), which seems stupid to me. Adobeglyph list therefore uses tcommaaccent for "t with cedilla", which lookslike "t with comma accent", but is on the wrong place. lmr have bothtcommaaccent and tcedilla. \tcedilla should be "t with cedilla" in myopinion and \tcommaaccent "t with comma accent". That currently isn'tthe case in ConTeXt unless something has changed recently.There are many other letter wrongly named in Unicode ("with cedilla"),although they have a comma. I would suggest to name them\[gklnr]commaaccent and use \[gklnr]cedilla as a synonym (if needed atall for backward compatibility, otherwise it would be better to leavethem out; there is no such letter with cedilla in unicode, if someoneneeds one, he can construct one trivially with \buildtextaccent)

7. there's "a-kind-of-bug-but-not-really-one" in enco-ans.tex.textcedilla maps to 184, which isn't defined in Antykwa for example(it's on place 24). It's more a "bug" in texnansi encoding, which hascedilla on two places, which is pretty stupid. But anyway:

\definecharacter textcedilla 24
would solve some problems (and hopefully not introduce new ones).

8. most letters are named
"c with cedilla" -> ccedilla

what about the names for "open o", "turned e", "long s", "turned r withhook"?

\openo or \oopen? \rturnedhook or \turnedrhook?

9. can latin letters and numbers be accessed somehow by name?

10. Adam prepared some dingbats support I think, this could be added here.

11. There's a showunicode pdf document on pragma-ade.com (at least I sawit once), but it's not listed on the overview.htm.

12. I don't know if anyone would ever need to switch from viscii regimeto some other, but what would happen to the characters under 128 (someof them are redefined in viscii)? I'm affraid that there would remainVietnamese leftovers in the lower part of the table.

13. If there are any other comments on the table and/or the script(s),please let me know.

IV. With the help of the prepared names list I processed definitions forregimes (taken from Unicode webpage) for ISO-8859-* and cp125* (othersshould be trivial). They are only preliminary, some (Hebrew, Thai,Arabic) probably don't make any sense yet, but could the rest be addedto ConTeXt after someone checks if everything is OK? (iso88595, cp1251,il1, il2, il9, windows and viscii regimes already exist and should becompared for differences)If possible in such a way that it wouldn't be necessary to include theregime definition file manually, but similarly as \usemodule[pre-polish]finds and processes the proper file, the \enableregime[xxx] should findthe proper file and load it.


(And for those who made it till here - sorry again for that gigantic mail.)
        Mojca
_______________________________________________
ntg-context mailing list
ntg-context@ntg.nl
http://www.ntg.nl/mailman/listinfo/ntg-context

[NTG-context] towards some more consistency in regimes & unicode support

Reply via email to