Hi,

Many thanks for your work and bug report! Could you send me
a fixed substring.pl? (If not, I will fix it.)
It would be fine to check your code on the huhyphn Hungarian hyphenation
patterns (>60000 patterns, http://www.tipogral.hu/huhyphn.tex).

I think, the sorting order and the input format don't matter,
but maybe you didn't use the newer substrings.pl of OpenOffice.org 2.0.2
with non-standard hyphenation and Unicode support (only non-standard
hyphenation pattern processing need special UTF-8 code, because
the 8-bit hyphenation algorithm handles the UTF-8 patterns correctly).

Standalone hyphenator and extended substring.pl:
http://www.openoffice.org/nonav/issues/showattachment.cgi/33618/altlinuxHyph2.tar.gz
See also Issue 58558: http://www.openoffice.org/issues/show_bug.cgi?id=58558.

I'm interested in your TeX hyphenation development, because I plan
a TeX prehyphenator with non-standard hyphenation, word
disambiguation and compound word decomposition.
I also plan a compound word decomposition preprocessor to OpenOffice.org
hyphenator based on Hunspell and the non-standard hyphenation extension
of OOo 2.0.2. With compound word decomposition, we will be able
to make small hyphenation dictionaries along with accurate compound
word hyphenation. (Now Huhyphn patterns with Libhnj use 9 MB memory,
thanks for the missing compound word decomposition.)

Best regards,

Laci



Quoting Nanning Buitenhuis <[EMAIL PROTECTED]>:

> Hi,
>
> I wrote a C replacement for substrings.pl.
> Although it uses an identical algorithm it is quite a bit faster:
>
> $ time ./substrings hyphen.us hyphen.new
> 0.04user 0.00system 0:00.05elapsed 84%CPU (0avgtext+0avgdata 0maxres)k
> 0inputs+0outputs (0major+381minor)pagefaults 0swaps
>
> $ time perl substrings.pl hyphen.us hyphen.mashed
> 1.09user 0.00system 0:01.13elapsed 97%CPU (0avgtext+0avgdata 0maxres)k
> 0inputs+0outputs (0major+832minor)pagefaults 0swaps
>
> It also fixed a minor bug in combine(): if a sub-pattern is found twice
> (or more) in the main pattern, then all occurences were changed instead
> of (the correct) last occurence. Only example in hyphen.us is 'tanta3'
>
> Other caveats are:
> - the output of the C version is sorted in unicode order
> - the input should be utf-8
>
> Anybody interested?
>    NaNning.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>




----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to