Re: Unifont Re: Everson Mono

2002-08-28 Thread Juliusz Chroboczek

S The people working on XFree86 have plans to convert the BDF's and
S PCF's that come with X to TTF fonts with one blank scalable glyph
S and the actual data stored in a bitmap data in the font. (It's actually
S a better format for the problem in many ways. Go figure.) I don't
S know if Microsoft Windows will like such a font, though.

That's my cue.

For your greatest crashing pleasure, I've made some bitmap-only TTF
from the GNU Unifont using an early alpha of the conversion tool that
might end up being used by XFree86.

THESE FONTS ARE PROBABLY NOT VALID TTF FONTS.  THEY WILL DO BAD
THINGS.  THEY WILL CAUSE YOUR WIFE TO RUN AWAY.  YOUR HUSBAND TO
DRINK.  YOUR DAUGHTER WILL SEARCH EMPLOYMENT WITH MICROSOFT.  YOUR SON
WILL THINK THAT APPLE'S IMPLEMENTATION OF UNICODE IS A GOOD IDEA.


I WILL NOT BE HELD RESPONSIBLE FOR ANY OF THE CONSEQUENCES.

However, if you are courageous enough to try them out, I've put an
archive of the generated TTFs on

  http://www.pps.jussieu.fr/~jch/private/unifont-ttf.zip

Please note that this is 1.7 megabytes (three versions of the font) so
do not download them unless you actually intend to try them out.  They
will disappear in a couple of days.

Please drop me a note with your results.

Regards,

Juliusz




Re: how to display japanese on english linux

2002-06-12 Thread Juliusz Chroboczek

YT what does English Linux mean ?  I don't think Linux itself is locale
YT specific. There are such thing call English Linux exist.

I believe that the original poster meant that his Linux distribution
comes with functional DVD-playing software.  American Linux cannot
include CSS decryption for legal reasons.

(Seriously: you are correct, of course.  The original poster, coming
from the marketing-dominated background of commercial software,
wrongly assumed that Free software comes in locale-specific versions.
The only reason some people like locale-specific software is that it
allows them to price it according to region, or to schedule releases
in a culturally profitable manner -- e.g. before Christmas in the
West.)

Juliusz




Re: how to display japanese on english linux

2002-06-10 Thread Juliusz Chroboczek

MB I installed unicode fonts and changed the locale using 'export
MB LANG=ja_JP'. I executed 'date' which returned some garbage ascii
MB characters but not japanese. I feel somewhere there is a problem
MB of selecting the right character set. The fonts are there on the
MB system, but they aint being picked up for display.

You need to run a terminal emulator with the right font set.

For the Unicode Japanese locale (LC_ALL=ja_JP.UTF8), any UTF-8
terminal should be fine; recent versions of XFree86 come with a
Unicode version of xterm, usually installed under the name uxterm
(it's actually the same binary as xterm, but run with different
options).  The Gnome and KDE terminal emulators also support UTF-8.

For the EUC-JP locale (LC_ALL=ja_JP), you need an... EUC-JP terminal
emulator.  Feel free to experiment with kterm or with the uxterm +
luit combo:

  LC_ALL=ja_JP uxterm -e luit

(In future versions of XFree86, xterm will do the Right Thing for the
current locale and selected font, i.e. run in eight-bit mode, run in
Unicode mode, or run in Unicode mode and invoke luit.)

Juliusz




Re: Normalisation and font technology

2002-05-29 Thread Juliusz Chroboczek

JH Apple recently started applying normalisation to file names in Mac
JH OS X, with the result that the content of folders can now only be
JH correctly displayed with fonts that contain the necessary AAT
JH table information

That's very surprising.  Especially considering the excellent job they
did with Openstep 4.0.

Even if you work with fully decomposed characters internally, mapping
to precomposed glyphs at display time is a triviality.  

And even if you don't find a suitable precomposed glyph or a suitable
entry in the smart font, for a large number of combining classes you
can provide legible albeit not necessarily typographically satisfying
output by semi-randomly positioning the components.

JH Do you really want word processing applications or web browsers
JH that can only correctly display text in a handful of fonts on a
JH user's system?

No.

  http://www.pps.jussieu.fr/~jch/software/cedilla/

Please note that this is not software meant for actual use; it is just
an experiment to show that we don't need heavy artillery in order to
implement reasonable typesetting for the GLC subset of Unicode.

JH This in turn suggests that if text is going to be decomposed in
JH normalisation, it should be recomposed in a buffered character
JH string prior to rendering.

The approach taken in Cedilla is different.  The text is typeset as a
sequence of Combining Character Sequences (CCS).  Given a (normalised)
CCS ``b c1 c2 ... cn'', Cedilla first attempts to find a precomposed
glyph; if that fails, it attempts to find a precomposed glyph for
``b c1 ... c(n-1)'', and compose it with the glyph for ``cn''.

All of that happens on the fly, there's never any need to do
buffering.  With suitable memoisation (caching), only a tiny fraction
of the execution time is spent on searching for the right glyphs.

Cedilla implements a number of other techniques for conjuring suitable
glyphs; the main difficulty was finding the right ordering of the
various fallbacks.  It turns out that it is more important to avoid
the ransom-note effect than find the best glyph.

Juliusz





Re: Normalisation and font technology

2002-05-29 Thread Juliusz Chroboczek

JJ and that AAT data in the fonts is respected by the Finder, even
JJ for PUA characters.  I can name a file in Pollard if I like, so
JJ long as an appropriate font is present.

A Unicode string is a finite sequence of 16-bit values the
interpretation of which is determined by the font currently in use?

Juliusz

P.S. Don't extrapolate: I think MacOS X is a very nice system indeed. 
 But the news given by John Hudson are depressing.




Re: Unicode and end users

2002-02-14 Thread Juliusz Chroboczek

MK What we are trying to establish is the exact meaning that UNICODE
MK ought to have - that is, if it can have one at all.

In the Unix-like world, the term ``UTF-8'' has been used quite
consistently, and most documentation avoids using Unicode for a disk
format (using it for the consortium, er., the Consortium, the
character repertoire and, when useful, for the coded character set).

The Unix-like public is used to thinking of UTF-8 as the format in
which Unicode text is saved on disk, and ``UTF-8 (Unicode)'' or
perhaps ``Unicode (UTF-8)'' should be the preferred user-interface
item.

MK Are there, in fact, many circumstances in which it is necessary
MK for an end user to create files that do *not* have a BOM at the
MK beginning?

You should never use either BOMs or UTF-16 on Unix-like systems; using
either will break too much of the system.

Juliusz




Re: A few questions about decomposition, equvalence and rendering

2002-02-06 Thread Juliusz Chroboczek

JC It's pretty much a given that a normalization form that meddles with
JC plain ASCII text isn't going to get used.

I had to think about it, but it does makes sense.

JC The U+1Fxx ones are the spacing compatibility equivalents,

Compatibility who with?

Juliusz




Re: A few questions about decomposition, equvalence and rendering

2002-02-06 Thread Juliusz Chroboczek

Thanks a lot for the explanations.

KW There is no good reason to invent composite combining marks
KW involving two accents together. (In fact, there are good reasons
KW *not* to do so.) The few that exist, e.g. U+0344, cause
KW implementation problems and are discouraged from use.

What are those problems?  As long as they have canonical
decompositions, won't such precomposed characters be discared at
normalisation time, hopefully during I/O?

(I'm not arguing in favour of precomposed characters; I'm just saying
that my gut instinct is that we have to deal with normalisation
anyway, and hence they don't complicate anything further; I'd be
curious to hear why you think otherwise.)

 As far as I can tell, there is nothing in the Unicode database that
 relates a ``modifier letter'' to the associated punctuation mark.

KW Correct. They are viewed as distinct classes.

 does anyone [have] a map from mathematical characters to the
 Geometric Shapes, Misc. symbols and Dingbats that would be useful
 for rendering?

KW As opposed to the characters themselves? I'm not sure what you
KW are getting at here.

The user invokes a search for ``f o g'' (the composite of g with f),
and she entered U+25CB WHITE CIRCLE.  The document does contain the
required formula, but encoded with U+2218 RING OPERATOR.  The user's
input was arguably incorrect, but I hope you'll agree that the search
should match.

I'm rendering a document that contains U+2218.  The current font
doesn't contain a glyph associated to this codepoint, but it has a
perfectly good glyph for U+25CB.  The rendering software should
silently use the latter.

Analogous examples can be made for the ``modifier letters''.

I'll mention that I do understand why these are encoded separately[1],
and I do understand why and how they will behave differently in a
number of situations.  I am merely noting that there are applications
(useful-in-practice search, rendering) where they may be identified or
at least related, and I am wondering whether people have already
compiled the data necessary to do so.

Thanks again,

Juliusz

[1] Offtopic: I have mixed feelings on the inclusion of STICS.  On the
one hand it's great to at last have a standardised encoding for math
characters, on the other I feel it is based on very different encoding
principles than the rest of Unicode.




A few questions about decomposition, equvalence and rendering

2002-02-05 Thread Juliusz Chroboczek

Dear all,

Sorry if these questions have been answered before.

Spacing diacritical marks (e.g. U+00A8) have compatibility
decompositions of the form 0020 .  Why are these not canonical
decompositions?  Under what circumstances would you expect the spacing
marks to behave differently from their decompositions?

The two that are in ASCII don't decompose.  Is that because they're
overloaded?

A number of combining characters (e.g. U+0340, U+0341, U+0343) have
canonical equivalents, i.e. canonical decompositions that are a single
character.  In other words, we have pairs of codepoints that are bound
to behave in exactly the same manner under all circumstances.  What's
the deal?

Unicode contains a number of precomposed spacing diacritical marks for
Greek (e.g. U+1FC1).  However, and unless I've missed something, with
the exception of U+0385, they do not have combining (non-spacing)
versions.  What's the rationale here?

(Similar precomposed diacritical marks do not seem to exist for
Vietnamese, which makes me think they've been included for
compatibility with legacy encodings rather than for a good reason.
Still, because their decompositions are not canonical, they need to be
taken into account, which in my case complicates what would otherwise
be somewhat cleaner code.)

When rendering stacked combining characters (i.e. sequences of
combining characters with the same non-zero combining class), which
sequences need to be treated specially (as opposed to being stacked on
top of each other)?  I already know about the pairs needed for Greek
(both Mono- and Polytonic) and Vietnamese.

As far as I can tell, there is nothing in the Unicode database that
relates a ``modifier letter'' to the associated punctuation mark.  Is
that right?  Does anyone have such data that I could steal?
(Hopefully with no legal strings attached.)

(Aside: I would expect a search function in a text editor or a search
engine to identify modifier letters with punctuation marks -- I expect
the two to be confused in practice.  But I couldn't find anything to
this effect in the Book.)

On a related note, does anyone has a map from mathematical characters
to the Geometric Shapes, Misc. symbols and Dingbats that would be
useful for rendering?

Thanks a lot,

Juliusz




Re: [OT] o-circumflex

2001-09-10 Thread Juliusz Chroboczek

 It's as weird as some Italian names for German cities: Aquisgrana
 for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di
 Baviera) for München.

MK Interesting that Polish names of these cities are more like Italian
MK than German: Akwizgran, Augsburg, Moguncja, Monachium.

Because they're adaptations of the mediaeval Latin names.

The same is true of historically important Polish cities, by the way:
Varsovie, Cracovie in French, Varsavia, Cracovia in Italian.  English
uses the German names instead (Warsaw, Cracow).

Juliusz




Re: Opentype support under Liunx

2001-08-22 Thread Juliusz Chroboczek

Dear William,

The author of Pango is Owen Taylor, and you can reach him as otaylor
at redhat.com.

I would very strongly suggest that you do so.  Owen has been doing a
great job, and I personally have no doubt that Pango is the future of
multilingual text display under Unix-like systems.  Adding support for
Burmese to Pango is the best way to ensure that future applications
will support your language.

OpenType is but one of the ways to go.  I believe that Owen is the
best person to ask about advice.

Regards,

Juliusz

P.S. Sarasvati, it looks like I'm banned from posting to the Unicode
list.  Any chance you could look into it?




Compressing Unicode [was: A UTF-8 based News Service]

2001-07-14 Thread Juliusz Chroboczek

[sorry if you receive this twice -- wee little problem with my mailer]

D Recently I created a test file of all Unicode characters in code
D point order (excluding the surrogates, but including all the other
D non-characters).  I will admit up front that this is a pathological
D test case and real-world data probably won't behave anywhere near
D the same.

This test is completely and utterly meaningless.  (CP/M 100 % faster
than Unix according to Ziff-Davis.)

Flate compression (used by both the ZIP and gzip formats) is a two
step process.  First, repeated strings are eliminated using a variant
of LZ.  Then, the resulting data are encoded using, I believe, dynamic
Huffman coding.

In the case of SCSU, your data contains the very same byte sequence
every window length.  The LZ compression will reduce every occurence
but the first of this sequence to a single token, which the Huffman
coding will then reduce to a handful of bits.

On the other case, the UTF-8 version of your data doesn't contain a
single repeated byte sequence, which is extremely pathological indeed.
Thus, Flate on this data degenerates to dynamic Huffman.

A trivial differential predictor (applied to codepoints, not to UTF-8
byte values) would yield much better results in this case than SCSU
(roughly 99.9% compression, I believe).  Doug, are you trying to sell
us a bridge?

Juliusz




Re: The perfect solution for the UTF-8/16 discussion

2001-06-26 Thread Juliusz Chroboczek

CB The solution to ASCII vs. EBCIDC would go away if we got all of
CB the hardware to support Unicode natively.

Visions of the MMU performing normalisation on the fly during a DMA
transfer from the paper tape reader.

Juliusz





More about UTF-8S: don't multiply UTFs

2001-06-14 Thread Juliusz Chroboczek

Dear all,

In the discussion about UTF-8S, there is one point that has not been
mentioned (or else I missed it).

Most people seem to be arguing from the point of view of users and
developers on platforms on which Unicode is well-established as the
default encoding.  On Unix-like systems, however, ISO 2022-based
encodings are still alive and kicking.  Hard.

One of the main arguments in favour of using Unicode on such platforms
is that it leads to a world in which there is only one encoding, both
for the user and the developer.  The multiplication of UTFs, however,
not only breaks this model, but also leads to much confusion.  (Heck,
many users still think that UTF-8 and Unicode are two completely
unrelated encodings!  Try explaining to them that UTF-16 is Unicode
too!)

I have tried to point this out when IANA were introducing UTF-16-BE
and other monstruosities, only to be treated in a rather patronising
manner by some of the respectable members of this list (``Juliusz's
confusion can be explained by...'').  Folks, from a user's perspec-
tive, UTF-8 and UTF-16 are two different encodings.  Please don't make
the situation worse than it already is.  Don't create any more UTFs.

Whatever happens, we will continue to promote signature-less UTF-8 as
the only user-visible encoding, and signature-less UTF-8 (mb) and
BOM-less UCS-4 (wc) as the only programmer-visible ones.  The more
UTFs the Unicode consortium legitimises, the more explaining we'll
have to do that ``this is just a version of Unicode used on some other
platforms, please convert it to UTF-8 before use.''

Regards,

Juliusz Chroboczek




Re: Support for UTF-8 in ISO-2022/6429 terminals

2001-05-11 Thread Juliusz Chroboczek

Darren,

DM Now, we added UTF-8 support to the ANSI task following the 
DM ISO-IR 196 specification.

This is great to hear.

DM Does anyone know of any examples of host computers or operating
DM systems that actually use UTF-8 on an ISO 6429 implementation?

Currently, the main application that can make good use of a UTF-8
terminal is the ``lynx'' text-mode web browser.  It will automatically
convert web pages from a variety of encodings into whatever the
terminal's encoding is, including UTF-8.

Perhaps more importantly, a number of Unix-like systems already have
or will soon have support for Unicode locales.  Properly
internationalised applications running under such locales assume UTF-8
for terminal I/O.

To summarise: vendors of terminal emulators are going to have to
provide UTF-8 support in the near future.  It is great to hear that
you've started working on this now, rather than when your customers
start complaining.

Regards,

Juliusz




Re: Latin w/ diacritics (was Re: benefits of unicode)

2001-04-19 Thread Juliusz Chroboczek

MC Well, I am not saying that it would be easy, or that it would be worth
MC doing, but would it really take *millions* of dollars for implementing
MC Unicode on DOS or Windows 3.1?

MC BTW, I don't know in detail the current status of Unicode support
MC on Linux, but I know that projects are ongoing.

Okay, I'll byte, although I prefer to speak of ``free Unix-like
systems'' rather than Linux only.

The easiest way of browsing the Multilingual web on a 386 with 4 MB of
memory and a 10 MB hard disk is probably to use the text-mode ``lynx''
browser in a terminal emulator that supports (a sufficiently large
subset of) Unicode.

One such terminal emulator is the Linux console, which only supports
the very basics of Unicode.  An alternative is the XFree86 version of
XTerm, which also supports single combining characters and
double-width glyphs.  (Enough, for example, for Chinese or Thai, but
not for Arabic.)  In order to use that on a machine such as the one
outlined above, you'll probably need to build a custom X server to
save space, but it's definitely doable.  (Drop me a note if you need a
hand.)

I know of the existence of fairly lightweight and fully
internationalised graphic browsers for Unix-like systems (Konqueror
comes to mind), but I doubt you'll get away with much less than a fast
486 with 12 MB memory and 100 MB of disk.

Regards,

Juliusz




Re: Displaying unicode.....

2001-04-04 Thread Juliusz Chroboczek

DG What is the best "way" to display unicode charatcers on an intel
DG platform running redhat Linux???

This is an interesting question, and one that is currently the subject of
much debate.

One possible answer is that you need to use version 2.2 or later of
the C library, version 4.0.3 or later of the XFree86 libraries and
fonts, and run in a UTF-8 locale.  Properly internationalised
applications should then be able to do some primitive processing of
Unicode text.

The other answer is that a number of recent applications use Unicode
internally in all locales, and only use the locale's encoding on I/O.
This is the case with the XFree86 version of XTerm when run with the
`-u8' flag, with Mozilla, with KDE 2, and I believe also with
development versions of Gnome.  Such applications are likely to have
better support for Unicode rendering (combining characters, contextual
glyph substitution, etc.)

A suitable forum for this sort of discussion is the XFree86 i18n list,
which you should feel welcome to join.

  http://www.xfree86.org/mailman/listinfo/i18n

Regards,

        Juliusz Chroboczek




Re: press release

2000-08-02 Thread Juliusz Chroboczek

MB Output goes to PDF, PostScript, line printers, PCL as well as
MB HTML/XML. It would sure be nice if all those technologies handled
MB context sensitive glyph placement...but this is only the year
MB 2000.

PostScript and to a certain extent PDF do not manipulate characters;
all they ever see is glyphs.  The application generating PS or PDF is
supposed to do the glyph selection and placement.

J.



Re: UTF-8N?

2000-06-21 Thread Juliusz Chroboczek

(I've allowed myself to quote from a number of distinct posts.)

DE On the contrary, I thought Peter's point was that the OS (or the
DE split/ merge programs) should *not* make any special assumptions
DE about text files.

Sorry if I wasn't clear.  I was taking for granted that OSes will not
reliably keep track of file types (we all know the problems that this
creates for VMS and Apple Mac users).  I was pointing out that without
a clear notion of file type, the BOM is a bad idea.

PC Without rules, users will generate UTF-8 files that both do and
PC don't start with a BOM. If there is software out that that's going
PC to blow up in one or the other case, that's not a satisfactory
PC state of affairs.

The problem is not one of broken software.  The problem is that, as
John Cowan explained in detail, with the addition of the BOM, UTF-8
and UTF-16 become ambiguous.  (In what follows, I use ``a Unicode
file'' for ``a file containing Unicode data in one of UTF-8 or UTF-16'').

It all stems from the fact that U+FEFF is not only what is used for
the BOM, but also a valid Unicode/ISO 10646 codepoint.  The issue
would be solved by deprecating the use of U+FEFF as a Unicode
character (for example by defining a new codepoint for ZWNBSP), and
using U+FEFF for the BOM only.  The standard could then say that
applications should discard all occurences of U+FEFF when reading a
file, and allow applications to insert U+FEFF at arbitrary points when
writing a Unicode file.

I suspect that deprecating U+FEFF is not politically acceptable for
Unicode and ISO 10646, though.

PC Doesn't that simply indicate that, in a protocol that disects a
PC long file into parts to be transmitted separately, it is
PC inappropriate to add a BOM to the beginnings of the parts, whether
PC they use UTF-8 or UTF-16?

Appropriate or not, users (you know, those people who don't read the
documentation that the programmers don't write) will use text editors
to split files.  They will then concatenate the files using a
non-Unicode aware tool.  And they will complain that the checksums
mismatch.

(What do *you* use to split files on a Windows machine that doesn't
have your favourite utilities installed?)

PC I think that the variations in BOM are just as "uninteresting" as
PC the variations in line ending:

Just as uninteresting and just as annoying.  The difference being that
we've had over twenty years to learn to deal with CR/LF mismatches
(and fixed-length records, and Fortran carriage control).  The BOM
issue opens a whole new area to make new mistakes in.

(Who should I contact to register ``UCS-4PDP11'', the mixed-endian
form of UCS-4?)

Regards,

        Juliusz Chroboczek