Re: Unifont Re: Everson Mono
S The people working on XFree86 have plans to convert the BDF's and S PCF's that come with X to TTF fonts with one blank scalable glyph S and the actual data stored in a bitmap data in the font. (It's actually S a better format for the problem in many ways. Go figure.) I don't S know if Microsoft Windows will like such a font, though. That's my cue. For your greatest crashing pleasure, I've made some bitmap-only TTF from the GNU Unifont using an early alpha of the conversion tool that might end up being used by XFree86. THESE FONTS ARE PROBABLY NOT VALID TTF FONTS. THEY WILL DO BAD THINGS. THEY WILL CAUSE YOUR WIFE TO RUN AWAY. YOUR HUSBAND TO DRINK. YOUR DAUGHTER WILL SEARCH EMPLOYMENT WITH MICROSOFT. YOUR SON WILL THINK THAT APPLE'S IMPLEMENTATION OF UNICODE IS A GOOD IDEA. I WILL NOT BE HELD RESPONSIBLE FOR ANY OF THE CONSEQUENCES. However, if you are courageous enough to try them out, I've put an archive of the generated TTFs on http://www.pps.jussieu.fr/~jch/private/unifont-ttf.zip Please note that this is 1.7 megabytes (three versions of the font) so do not download them unless you actually intend to try them out. They will disappear in a couple of days. Please drop me a note with your results. Regards, Juliusz
Re: how to display japanese on english linux
YT what does English Linux mean ? I don't think Linux itself is locale YT specific. There are such thing call English Linux exist. I believe that the original poster meant that his Linux distribution comes with functional DVD-playing software. American Linux cannot include CSS decryption for legal reasons. (Seriously: you are correct, of course. The original poster, coming from the marketing-dominated background of commercial software, wrongly assumed that Free software comes in locale-specific versions. The only reason some people like locale-specific software is that it allows them to price it according to region, or to schedule releases in a culturally profitable manner -- e.g. before Christmas in the West.) Juliusz
Re: how to display japanese on english linux
MB I installed unicode fonts and changed the locale using 'export MB LANG=ja_JP'. I executed 'date' which returned some garbage ascii MB characters but not japanese. I feel somewhere there is a problem MB of selecting the right character set. The fonts are there on the MB system, but they aint being picked up for display. You need to run a terminal emulator with the right font set. For the Unicode Japanese locale (LC_ALL=ja_JP.UTF8), any UTF-8 terminal should be fine; recent versions of XFree86 come with a Unicode version of xterm, usually installed under the name uxterm (it's actually the same binary as xterm, but run with different options). The Gnome and KDE terminal emulators also support UTF-8. For the EUC-JP locale (LC_ALL=ja_JP), you need an... EUC-JP terminal emulator. Feel free to experiment with kterm or with the uxterm + luit combo: LC_ALL=ja_JP uxterm -e luit (In future versions of XFree86, xterm will do the Right Thing for the current locale and selected font, i.e. run in eight-bit mode, run in Unicode mode, or run in Unicode mode and invoke luit.) Juliusz
Re: Normalisation and font technology
JH Apple recently started applying normalisation to file names in Mac JH OS X, with the result that the content of folders can now only be JH correctly displayed with fonts that contain the necessary AAT JH table information That's very surprising. Especially considering the excellent job they did with Openstep 4.0. Even if you work with fully decomposed characters internally, mapping to precomposed glyphs at display time is a triviality. And even if you don't find a suitable precomposed glyph or a suitable entry in the smart font, for a large number of combining classes you can provide legible albeit not necessarily typographically satisfying output by semi-randomly positioning the components. JH Do you really want word processing applications or web browsers JH that can only correctly display text in a handful of fonts on a JH user's system? No. http://www.pps.jussieu.fr/~jch/software/cedilla/ Please note that this is not software meant for actual use; it is just an experiment to show that we don't need heavy artillery in order to implement reasonable typesetting for the GLC subset of Unicode. JH This in turn suggests that if text is going to be decomposed in JH normalisation, it should be recomposed in a buffered character JH string prior to rendering. The approach taken in Cedilla is different. The text is typeset as a sequence of Combining Character Sequences (CCS). Given a (normalised) CCS ``b c1 c2 ... cn'', Cedilla first attempts to find a precomposed glyph; if that fails, it attempts to find a precomposed glyph for ``b c1 ... c(n-1)'', and compose it with the glyph for ``cn''. All of that happens on the fly, there's never any need to do buffering. With suitable memoisation (caching), only a tiny fraction of the execution time is spent on searching for the right glyphs. Cedilla implements a number of other techniques for conjuring suitable glyphs; the main difficulty was finding the right ordering of the various fallbacks. It turns out that it is more important to avoid the ransom-note effect than find the best glyph. Juliusz
Re: Normalisation and font technology
JJ and that AAT data in the fonts is respected by the Finder, even JJ for PUA characters. I can name a file in Pollard if I like, so JJ long as an appropriate font is present. A Unicode string is a finite sequence of 16-bit values the interpretation of which is determined by the font currently in use? Juliusz P.S. Don't extrapolate: I think MacOS X is a very nice system indeed. But the news given by John Hudson are depressing.
Re: Unicode and end users
MK What we are trying to establish is the exact meaning that UNICODE MK ought to have - that is, if it can have one at all. In the Unix-like world, the term ``UTF-8'' has been used quite consistently, and most documentation avoids using Unicode for a disk format (using it for the consortium, er., the Consortium, the character repertoire and, when useful, for the coded character set). The Unix-like public is used to thinking of UTF-8 as the format in which Unicode text is saved on disk, and ``UTF-8 (Unicode)'' or perhaps ``Unicode (UTF-8)'' should be the preferred user-interface item. MK Are there, in fact, many circumstances in which it is necessary MK for an end user to create files that do *not* have a BOM at the MK beginning? You should never use either BOMs or UTF-16 on Unix-like systems; using either will break too much of the system. Juliusz
Re: A few questions about decomposition, equvalence and rendering
JC It's pretty much a given that a normalization form that meddles with JC plain ASCII text isn't going to get used. I had to think about it, but it does makes sense. JC The U+1Fxx ones are the spacing compatibility equivalents, Compatibility who with? Juliusz
Re: A few questions about decomposition, equvalence and rendering
Thanks a lot for the explanations. KW There is no good reason to invent composite combining marks KW involving two accents together. (In fact, there are good reasons KW *not* to do so.) The few that exist, e.g. U+0344, cause KW implementation problems and are discouraged from use. What are those problems? As long as they have canonical decompositions, won't such precomposed characters be discared at normalisation time, hopefully during I/O? (I'm not arguing in favour of precomposed characters; I'm just saying that my gut instinct is that we have to deal with normalisation anyway, and hence they don't complicate anything further; I'd be curious to hear why you think otherwise.) As far as I can tell, there is nothing in the Unicode database that relates a ``modifier letter'' to the associated punctuation mark. KW Correct. They are viewed as distinct classes. does anyone [have] a map from mathematical characters to the Geometric Shapes, Misc. symbols and Dingbats that would be useful for rendering? KW As opposed to the characters themselves? I'm not sure what you KW are getting at here. The user invokes a search for ``f o g'' (the composite of g with f), and she entered U+25CB WHITE CIRCLE. The document does contain the required formula, but encoded with U+2218 RING OPERATOR. The user's input was arguably incorrect, but I hope you'll agree that the search should match. I'm rendering a document that contains U+2218. The current font doesn't contain a glyph associated to this codepoint, but it has a perfectly good glyph for U+25CB. The rendering software should silently use the latter. Analogous examples can be made for the ``modifier letters''. I'll mention that I do understand why these are encoded separately[1], and I do understand why and how they will behave differently in a number of situations. I am merely noting that there are applications (useful-in-practice search, rendering) where they may be identified or at least related, and I am wondering whether people have already compiled the data necessary to do so. Thanks again, Juliusz [1] Offtopic: I have mixed feelings on the inclusion of STICS. On the one hand it's great to at last have a standardised encoding for math characters, on the other I feel it is based on very different encoding principles than the rest of Unicode.
A few questions about decomposition, equvalence and rendering
Dear all, Sorry if these questions have been answered before. Spacing diacritical marks (e.g. U+00A8) have compatibility decompositions of the form 0020 . Why are these not canonical decompositions? Under what circumstances would you expect the spacing marks to behave differently from their decompositions? The two that are in ASCII don't decompose. Is that because they're overloaded? A number of combining characters (e.g. U+0340, U+0341, U+0343) have canonical equivalents, i.e. canonical decompositions that are a single character. In other words, we have pairs of codepoints that are bound to behave in exactly the same manner under all circumstances. What's the deal? Unicode contains a number of precomposed spacing diacritical marks for Greek (e.g. U+1FC1). However, and unless I've missed something, with the exception of U+0385, they do not have combining (non-spacing) versions. What's the rationale here? (Similar precomposed diacritical marks do not seem to exist for Vietnamese, which makes me think they've been included for compatibility with legacy encodings rather than for a good reason. Still, because their decompositions are not canonical, they need to be taken into account, which in my case complicates what would otherwise be somewhat cleaner code.) When rendering stacked combining characters (i.e. sequences of combining characters with the same non-zero combining class), which sequences need to be treated specially (as opposed to being stacked on top of each other)? I already know about the pairs needed for Greek (both Mono- and Polytonic) and Vietnamese. As far as I can tell, there is nothing in the Unicode database that relates a ``modifier letter'' to the associated punctuation mark. Is that right? Does anyone have such data that I could steal? (Hopefully with no legal strings attached.) (Aside: I would expect a search function in a text editor or a search engine to identify modifier letters with punctuation marks -- I expect the two to be confused in practice. But I couldn't find anything to this effect in the Book.) On a related note, does anyone has a map from mathematical characters to the Geometric Shapes, Misc. symbols and Dingbats that would be useful for rendering? Thanks a lot, Juliusz
Re: [OT] o-circumflex
It's as weird as some Italian names for German cities: Aquisgrana for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di Baviera) for München. MK Interesting that Polish names of these cities are more like Italian MK than German: Akwizgran, Augsburg, Moguncja, Monachium. Because they're adaptations of the mediaeval Latin names. The same is true of historically important Polish cities, by the way: Varsovie, Cracovie in French, Varsavia, Cracovia in Italian. English uses the German names instead (Warsaw, Cracow). Juliusz
Re: Opentype support under Liunx
Dear William, The author of Pango is Owen Taylor, and you can reach him as otaylor at redhat.com. I would very strongly suggest that you do so. Owen has been doing a great job, and I personally have no doubt that Pango is the future of multilingual text display under Unix-like systems. Adding support for Burmese to Pango is the best way to ensure that future applications will support your language. OpenType is but one of the ways to go. I believe that Owen is the best person to ask about advice. Regards, Juliusz P.S. Sarasvati, it looks like I'm banned from posting to the Unicode list. Any chance you could look into it?
Compressing Unicode [was: A UTF-8 based News Service]
[sorry if you receive this twice -- wee little problem with my mailer] D Recently I created a test file of all Unicode characters in code D point order (excluding the surrogates, but including all the other D non-characters). I will admit up front that this is a pathological D test case and real-world data probably won't behave anywhere near D the same. This test is completely and utterly meaningless. (CP/M 100 % faster than Unix according to Ziff-Davis.) Flate compression (used by both the ZIP and gzip formats) is a two step process. First, repeated strings are eliminated using a variant of LZ. Then, the resulting data are encoded using, I believe, dynamic Huffman coding. In the case of SCSU, your data contains the very same byte sequence every window length. The LZ compression will reduce every occurence but the first of this sequence to a single token, which the Huffman coding will then reduce to a handful of bits. On the other case, the UTF-8 version of your data doesn't contain a single repeated byte sequence, which is extremely pathological indeed. Thus, Flate on this data degenerates to dynamic Huffman. A trivial differential predictor (applied to codepoints, not to UTF-8 byte values) would yield much better results in this case than SCSU (roughly 99.9% compression, I believe). Doug, are you trying to sell us a bridge? Juliusz
Re: The perfect solution for the UTF-8/16 discussion
CB The solution to ASCII vs. EBCIDC would go away if we got all of CB the hardware to support Unicode natively. Visions of the MMU performing normalisation on the fly during a DMA transfer from the paper tape reader. Juliusz
More about UTF-8S: don't multiply UTFs
Dear all, In the discussion about UTF-8S, there is one point that has not been mentioned (or else I missed it). Most people seem to be arguing from the point of view of users and developers on platforms on which Unicode is well-established as the default encoding. On Unix-like systems, however, ISO 2022-based encodings are still alive and kicking. Hard. One of the main arguments in favour of using Unicode on such platforms is that it leads to a world in which there is only one encoding, both for the user and the developer. The multiplication of UTFs, however, not only breaks this model, but also leads to much confusion. (Heck, many users still think that UTF-8 and Unicode are two completely unrelated encodings! Try explaining to them that UTF-16 is Unicode too!) I have tried to point this out when IANA were introducing UTF-16-BE and other monstruosities, only to be treated in a rather patronising manner by some of the respectable members of this list (``Juliusz's confusion can be explained by...''). Folks, from a user's perspec- tive, UTF-8 and UTF-16 are two different encodings. Please don't make the situation worse than it already is. Don't create any more UTFs. Whatever happens, we will continue to promote signature-less UTF-8 as the only user-visible encoding, and signature-less UTF-8 (mb) and BOM-less UCS-4 (wc) as the only programmer-visible ones. The more UTFs the Unicode consortium legitimises, the more explaining we'll have to do that ``this is just a version of Unicode used on some other platforms, please convert it to UTF-8 before use.'' Regards, Juliusz Chroboczek
Re: Support for UTF-8 in ISO-2022/6429 terminals
Darren, DM Now, we added UTF-8 support to the ANSI task following the DM ISO-IR 196 specification. This is great to hear. DM Does anyone know of any examples of host computers or operating DM systems that actually use UTF-8 on an ISO 6429 implementation? Currently, the main application that can make good use of a UTF-8 terminal is the ``lynx'' text-mode web browser. It will automatically convert web pages from a variety of encodings into whatever the terminal's encoding is, including UTF-8. Perhaps more importantly, a number of Unix-like systems already have or will soon have support for Unicode locales. Properly internationalised applications running under such locales assume UTF-8 for terminal I/O. To summarise: vendors of terminal emulators are going to have to provide UTF-8 support in the near future. It is great to hear that you've started working on this now, rather than when your customers start complaining. Regards, Juliusz
Re: Latin w/ diacritics (was Re: benefits of unicode)
MC Well, I am not saying that it would be easy, or that it would be worth MC doing, but would it really take *millions* of dollars for implementing MC Unicode on DOS or Windows 3.1? MC BTW, I don't know in detail the current status of Unicode support MC on Linux, but I know that projects are ongoing. Okay, I'll byte, although I prefer to speak of ``free Unix-like systems'' rather than Linux only. The easiest way of browsing the Multilingual web on a 386 with 4 MB of memory and a 10 MB hard disk is probably to use the text-mode ``lynx'' browser in a terminal emulator that supports (a sufficiently large subset of) Unicode. One such terminal emulator is the Linux console, which only supports the very basics of Unicode. An alternative is the XFree86 version of XTerm, which also supports single combining characters and double-width glyphs. (Enough, for example, for Chinese or Thai, but not for Arabic.) In order to use that on a machine such as the one outlined above, you'll probably need to build a custom X server to save space, but it's definitely doable. (Drop me a note if you need a hand.) I know of the existence of fairly lightweight and fully internationalised graphic browsers for Unix-like systems (Konqueror comes to mind), but I doubt you'll get away with much less than a fast 486 with 12 MB memory and 100 MB of disk. Regards, Juliusz
Re: Displaying unicode.....
DG What is the best "way" to display unicode charatcers on an intel DG platform running redhat Linux??? This is an interesting question, and one that is currently the subject of much debate. One possible answer is that you need to use version 2.2 or later of the C library, version 4.0.3 or later of the XFree86 libraries and fonts, and run in a UTF-8 locale. Properly internationalised applications should then be able to do some primitive processing of Unicode text. The other answer is that a number of recent applications use Unicode internally in all locales, and only use the locale's encoding on I/O. This is the case with the XFree86 version of XTerm when run with the `-u8' flag, with Mozilla, with KDE 2, and I believe also with development versions of Gnome. Such applications are likely to have better support for Unicode rendering (combining characters, contextual glyph substitution, etc.) A suitable forum for this sort of discussion is the XFree86 i18n list, which you should feel welcome to join. http://www.xfree86.org/mailman/listinfo/i18n Regards, Juliusz Chroboczek
Re: press release
MB Output goes to PDF, PostScript, line printers, PCL as well as MB HTML/XML. It would sure be nice if all those technologies handled MB context sensitive glyph placement...but this is only the year MB 2000. PostScript and to a certain extent PDF do not manipulate characters; all they ever see is glyphs. The application generating PS or PDF is supposed to do the glyph selection and placement. J.
Re: UTF-8N?
(I've allowed myself to quote from a number of distinct posts.) DE On the contrary, I thought Peter's point was that the OS (or the DE split/ merge programs) should *not* make any special assumptions DE about text files. Sorry if I wasn't clear. I was taking for granted that OSes will not reliably keep track of file types (we all know the problems that this creates for VMS and Apple Mac users). I was pointing out that without a clear notion of file type, the BOM is a bad idea. PC Without rules, users will generate UTF-8 files that both do and PC don't start with a BOM. If there is software out that that's going PC to blow up in one or the other case, that's not a satisfactory PC state of affairs. The problem is not one of broken software. The problem is that, as John Cowan explained in detail, with the addition of the BOM, UTF-8 and UTF-16 become ambiguous. (In what follows, I use ``a Unicode file'' for ``a file containing Unicode data in one of UTF-8 or UTF-16''). It all stems from the fact that U+FEFF is not only what is used for the BOM, but also a valid Unicode/ISO 10646 codepoint. The issue would be solved by deprecating the use of U+FEFF as a Unicode character (for example by defining a new codepoint for ZWNBSP), and using U+FEFF for the BOM only. The standard could then say that applications should discard all occurences of U+FEFF when reading a file, and allow applications to insert U+FEFF at arbitrary points when writing a Unicode file. I suspect that deprecating U+FEFF is not politically acceptable for Unicode and ISO 10646, though. PC Doesn't that simply indicate that, in a protocol that disects a PC long file into parts to be transmitted separately, it is PC inappropriate to add a BOM to the beginnings of the parts, whether PC they use UTF-8 or UTF-16? Appropriate or not, users (you know, those people who don't read the documentation that the programmers don't write) will use text editors to split files. They will then concatenate the files using a non-Unicode aware tool. And they will complain that the checksums mismatch. (What do *you* use to split files on a Windows machine that doesn't have your favourite utilities installed?) PC I think that the variations in BOM are just as "uninteresting" as PC the variations in line ending: Just as uninteresting and just as annoying. The difference being that we've had over twenty years to learn to deal with CR/LF mismatches (and fixed-length records, and Fortran carriage control). The BOM issue opens a whole new area to make new mistakes in. (Who should I contact to register ``UCS-4PDP11'', the mixed-endian form of UCS-4?) Regards, Juliusz Chroboczek