> It would be fine to check your code on the huhyphn Hungarian hyphenation
> patterns (>60000 patterns, http://www.tipogral.hu/huhyphn.tex).

$ time perl substrings.pl huhyphn.tex  huhyphn-perl.mashed
 22.01user 0.07system 0:22.32elapsed 98%CPU (0avgtext+0avgdata 0maxres)k
 0inputs+0outputs (0major+5800minor)pagefaults 0swaps
$ time ./substrings-8bit huhyphn.tex  huhyphn-c8bit.mashed
 1.38user 0.03system 0:01.53elapsed 92%CPU (0avgtext+0avgdata 0maxres)k
 0inputs+0outputs (0major+4100minor)pagefaults 0swaps

> I think, the sorting order and the input format don't matter,
> but maybe you didn't use the newer substrings.pl of OpenOffice.org 2.0.2
> with non-standard hyphenation and Unicode support (only non-standard
> hyphenation pattern processing need special UTF-8 code, because
> the 8-bit hyphenation algorithm handles the UTF-8 patterns correctly).

Your right, but for a silly feature in my code (debugging info). The C code is now uninterested in the encoding of the input (as long it ain't EBCIDC). The fact that the output is sorted is a side-effect, not a feature. It slightly slows down the UTF-8 case as all invalid utf-8 sequences are checked too.

> I'm interested in your TeX hyphenation development, because I plan
> a TeX prehyphenator with non-standard hyphenation, word
> disambiguation and compound word decomposition.
> I also plan a compound word decomposition preprocessor to OpenOffice.org
> hyphenator based on Hunspell and the non-standard hyphenation extension
> of OOo 2.0.2. With compound word decomposition, we will be able
> to make small hyphenation dictionaries along with accurate compound
> word hyphenation. (Now Huhyphn patterns with Libhnj use 9 MB memory,
> thanks for the missing compound word decomposition.)

This is for Taco's MetaTeX project, funded by Colorado State University.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to