Arthur Reutenauer wrote:

In order to hyphenate a word in a given language, you need a list of
patterns for that language.  Let’s say the word is “hyphenation” and the
patterns are Knuth and Liang’s file hyphen.tex (available from CTAN:
http://mirror.ctan.org/systems/knuth/dist/lib/hyphen.tex).


I think that what Arthur has written is very helpful, but it will surely leave 
the intelligent reader asking "but how were those patterns generated, and what 
do the numbers mean".  The introduction to Patgen.web sheds some light on this :

Introduction. This program takes a list of hyphenated words and generates a set 
of patterns that
can be used by the TEX82 hyphenation algorithm.

The patterns consist of strings of letters and digits, where a digit indicates 
a 'hyphenation value' for some
intercharacter position. For example, the pattern "3t2ion" speci es that if the 
string "tion" occurs in a word,
we should assign a hyphenation value of 3 to the position immediately before 
the "t", and a value of 2 to the
position between the "t" and the "i".

The patterns are generated in a series of sequential passes through the 
dictionary. In each pass, we
collect count statistics for a particular type of pattern, taking into account 
the e ffect of patterns chosen in
previous passes. At the end of a pass, the counts are examined and new patterns 
are selected.
Patterns are chosen one level at a time, in order of increasing hyphenation 
value. In the sample run
shown below, the parameters "hyph start" and "hyph finish" specify the fi rst 
and last levels respectively to be
generated.

Patterns at each level are chosen in order of increasing pattern length 
(usually starting with length 2).
This is controlled by the parameters "pat start" and "pat fi nish" speci ed at 
the beginning of each level.
Furthermore patterns of the same length applying to di fferent intercharacter 
positions are chosen in
separate passes through the dictionary.  Since patterns of length n may apply 
to n + 1 diff erent positions,
choosing a set of patterns of lengths 2 through n for a given level requires 
(n+1)(n+2)=2 \ge 3 passes through
the word list.

At each level, the selection of patterns is controlled by the three parameters 
"good wt" , "bad wt"  and "thresh".
A hyphenating pattern will be selected if good * good wt – bad * bad wt \ge 
thresh , where "good" and "bad" are
the number of times the pattern could and could not be hyphenated respectively 
at a particular point.
For inhibiting patterns, "good" is the number of errors inhibited, and "bad" is 
the number of previously found
hyphens inhibited.

The interested reader is referred to (e.g.,) 
http://readytext.co.uk/files/patgen.pdf
Philip Taylor

Reply via email to