On 9/10/14 13:41, Philip Taylor wrote:


Nathan Wells asked  :

I am not sure if this is the right place to ask, but I am trying
to create hyphenation rules for a UTF-8 language (Khmer). I've
tried patgen, but I can't get it to work (some have said it
doesn't support UTF-8?).

to which Werner Lemberg replied via Stack Exchange.  Since not all
subscribed to these lists will necessarily also read Stack Exchange (I
don't, for example), I have repeated some parts of his answer here, to
which I have added related questions of my own.  I have also opened up
the distribution to the XeTeX list, since it seems extremely relevant
thereto.

First of all, whatever you are going to achieve, it won't work with
‘classical’ TeX. This is due to a design decision of Knuth – today we
know that this was unfortunate, but at the time of writing TeX this
was far less obvious: Hyphenation patterns are applied to glyph
indices and not to input character codes. Since there are more than
256 Khmer ligature glyphs, the standard hyphenation algorithm can't
be applied.

Today, this design problem can be circumvented natively by luatex
only,

Does XeTeX also address this issue (open question, not one to which I
claim to already know the answer) ?

When working with Unicode and OpenType fonts, XeTeX applies hyphenation to the characters of the text, not to glyph indices in a font. So the number of *characters* (not glyphs) involved in Khmer should not be a problem for creating XeTeX-compatible Unicode patterns with patgen, perhaps by using the trick of mapping the Khmer Unicode characters to 8-bit values, generating patterns, and then mapping the result back to real Unicode.


Now back to your problem. The patgen program is completely agnostic
of what it processes; the only limitation is that it cannot handle
more than 243 entities: The 8bit range of 256 characters minus the
digits 0-9 and characters ‘.’, ‘-’, and ‘*’ (which can be mapped to
different characters if necessary). Since the number of Khmer
characters is less than 128, patgen can be used to create patterns.

OK, so let's open up the question from just Khmer :  if I were to want
to build patterns for a language that had more than 243 characters, is
there a variant of Patgen that can correctly handle such a task ?

That would presumably be opatgen, but some work may be needed to get it to compile and run on current systems. (Presumably it used to work on some system or other, at some point in the past.)

JK



--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex

Reply via email to