> It would be fine to check your code on the huhyphn Hungarian hyphenation
> patterns (>60000 patterns, http://www.tipogral.hu/huhyphn.tex).
$ time perl substrings.pl huhyphn.tex huhyphn-perl.mashed
22.01user 0.07system 0:22.32elapsed 98%CPU (0avgtext+0avgdata 0maxres)k
0inputs+0outputs (0major+5800minor)pagefaults 0swaps
$ time ./substrings-8bit huhyphn.tex huhyphn-c8bit.mashed
1.38user 0.03system 0:01.53elapsed 92%CPU (0avgtext+0avgdata 0maxres)k
0inputs+0outputs (0major+4100minor)pagefaults 0swaps
> I think, the sorting order and the input format don't matter,
> but maybe you didn't use the newer substrings.pl of OpenOffice.org 2.0.2
> with non-standard hyphenation and Unicode support (only non-standard
> hyphenation pattern processing need special UTF-8 code, because
> the 8-bit hyphenation algorithm handles the UTF-8 patterns correctly).
Your right, but for a silly feature in my code (debugging info). The C code is
now uninterested in the encoding of the input (as long it ain't EBCIDC).
The fact that the output is sorted is a side-effect, not a feature. It
slightly slows down the UTF-8 case as all invalid utf-8 sequences are checked too.
> I'm interested in your TeX hyphenation development, because I plan
> a TeX prehyphenator with non-standard hyphenation, word
> disambiguation and compound word decomposition.
> I also plan a compound word decomposition preprocessor to OpenOffice.org
> hyphenator based on Hunspell and the non-standard hyphenation extension
> of OOo 2.0.2. With compound word decomposition, we will be able
> to make small hyphenation dictionaries along with accurate compound
> word hyphenation. (Now Huhyphn patterns with Libhnj use 9 MB memory,
> thanks for the missing compound word decomposition.)
This is for Taco's MetaTeX project, funded by Colorado State University.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]