Hi, I've recently added Thai hyphenation patterns to hyph-utf8. I have a few questions though (I was already discussing it a bit with Taco with respect to Lao or some Ethiopic script on some field trip long time ago).
Description: Words in Thai aren't separated with spaces, so one can end up with potentially infinite strings that TeX considers to be a single word. According to my understanding there are two problems to be solved: a) splitting sentences into words b) syllabification of words Hyphenation patterns could in principle do both simultaneously, but at one point (at 64 or 256 characters in LuaTeX) TeX runs into a problem of "too long word to hyphenate" and simply stops. (I still believe that the hyphenation algorithm should be able to work on infinite strings as long as hyphenation patterns are of finite length, but I'm not comfortable working with TeX sources, and this is a bit off-topic anyway.) I thought at first that ICU library in XeTeX does both, but I was told that it only does word-splitting, so hyphenation still remains to be done. (Honestly, I don't see why it couldn't do syllabification in addition to word splitting since determining boundaries of syllables must be an easier problem, but I might be wrong.) In pdfTeX the problem is solved by running a special program "swath": - http://linux.thai.net/projects/swath (currently broken site) - http://www.cs.cmu.edu/~paisarn/software.html (broken link) which reads the input file and creates an output tex file with a command sequence \wbr insterted between words. After that TeX can do its job to hyphenate separate words easily, but that requires an external preprocessor and "latex" as such cannot be run out-of-the-box. Question: My question is: are there any plans or visions about how the problem should be tackled in LuaTeX in the most elegant way, given the absence of the ICU library? I could also ask differently: suppose that a motivated Thai programmer would be willing to work on solving the problem properly. What would be the suggested solution? Thank you, Mojca
