Re: Hebrew spell-checking in OpenOffice
On Tue, Nov 2, 2010 at 3:55 PM, Lior Kaplan kaplanl...@gmail.com wrote: On Tue, Nov 2, 2010 at 2:44 PM, Nadav Har'El n...@math.technion.ac.ilwrote: On Tue, Nov 02, 2010, Lior Kaplan wrote about Re: Hebrew spell-checking in OpenOffice: I've double checked this, and Debian doesn't include a tool needed for building the hunspell target. See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=602189 I see :( However, if we're talking about the OpenOffice package, not the debian package, you're not really constrained by what is available on Debian. I'm constrained as I work (and package) on Debian. But I'll take the dictionary files from the Fedora RPM. I'll update you when a new extension will be ready and we'll test it. New version is available. Dictionary taken from Fedora (hunspell format with fastverb). http://extensions.services.openoffice.org/en/project/dict-he Kaplan ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Hebrew spell-checking in OpenOffice
The first issue is acronyms (rashei tevot) and abbreviations. In Hebrew, these use the geresh and gershaim (or single or double quotes), which is part of the word. OpenOffice does not understand that these quotes are part of the Hebrew word, and splits the word on them. As a result all acronyms are marked as spelling mistakes. This is really annoying, especially for certain types of documents where acronyms are common. I filed this bug somewhere, but I cannot find it. While looking for it I found this other bug that might interest you, though: https://bugs.kde.org/show_bug.cgi?id=169403 If I find the hspell bug I'll post back. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Hebrew spell-checking in OpenOffice
2010/11/2 Nadav Har'El n...@math.technion.ac.il Recently I noticed that (thanks to Lior Kaplan, it seems) it is now trivial to get Hebrew spellchecking (based on Hspell 1.1) in OpenOffice. The Hebrew localized version (now available on the official OpenOffice site!) comes with Hebrew spell-checking pre-bundled, and there's an extension [1] for those who use the English version of open-office. My pleasure (: It's available only as the 3.3 RC releases, and will be available on the final release. http://download.openoffice.org/all_rc.html The first issue is acronyms (rashei tevot) and abbreviations. In Hebrew, these use the geresh and gershaim (or single or double quotes), which is part of the word. OpenOffice does not understand that these quotes are part of the Hebrew word, and splits the word on them. As a result all acronyms are marked as spelling mistakes. This is really annoying, especially for certain types of documents where acronyms are common. Known issue, and reported at http://www.openoffice.org/issues/show_bug.cgi?id=99796 It is marked for work during the 3.4 release. The second issue is the correction suggestions for spelling errors. All the suggestions indeed appear to be valid words, but their order is terrible - it appears little or no attention was paid to trying to provide the most likely suggestions first. The screenshot on the extension page [1] provides an excellent example: When given the mis-spelling עיברי, rather than provide the most likely suggestion first - עברי, it is given as the 8th suggestion, and the first suggestions are highly unlikely. [..] I believe that hunspell's dictionary in fact has a way to give such correction rules, but I don't know how to correctly write them, or how to make OpenOffice use them. The word list in the extension is created with myspell's format. Hunspell should be similar but I couldn't build that format at the time. The builds were done as part of the debian hspell package which I maintain. ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Hebrew spell-checking in OpenOffice
On Tue, Nov 02, 2010, Lior Kaplan wrote about Re: Hebrew spell-checking in OpenOffice: I believe that hunspell's dictionary in fact has a way to give such correction rules, but I don't know how to correctly write them, or how to make OpenOffice use them. The word list in the extension is created with myspell's format. Hunspell should be similar but I couldn't build that format at the time. The builds were done as part of the debian hspell package which I maintain. Please let me know if you need help creating a hunspell-format dictionary from Hspell (it shouldn't be difficult - basically make hunspell should do it). OpenOffice loads the hunspell-format dictionary (with so-called double affix compression) *much* faster than it does the old myspell format, which fixes the old lockup-for-many-seconds-while-loading-the-hebrew- dictionary bug (see http://qa.openoffice.org/issues/show_bug.cgi?id=66939). So it is actually important that you use the hunspell target, not the myspell target, in your packages. -- Nadav Har'El|Tuesday, Nov 2 2010, 25 Heshvan 5771 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |A facility for quotation covers the http://nadav.harel.org.il |absence of original thought. ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Hebrew spell-checking in OpenOffice
On Tue, Nov 02, 2010, Lior Kaplan wrote about Re: Hebrew spell-checking in OpenOffice: Known issue, and reported at http://www.openoffice.org/issues/show_bug.cgi?id=99796 Thanks for the pointer. I'll vote for the issue (if I can be of any other help, please let me know). The second issue is the correction suggestions for spelling errors. All the suggestions indeed appear to be valid words, but their order is terrible - it appears little or no attention was paid to trying to provide Dan checked, and it appears that the suboptimal (to be gentle) corrections indeed are not specific to OpenOffice, and happen already in hunspell. e.g., try $ echo עברי | hunspell -d he_IL Looking at the hunspell documentation, I see that there are TRY, REP and MAP keywords in the dictionary which can be used to specify letters that sound the same, and so on. We already used TRY, but not any of the others - and I guess we need to. Does anyone on this list have any experience with those? In particular, can one of these keywords be used to say that inserting or deleting waw or yod is more likely then inserting or deleting a gimel? -- Nadav Har'El|Tuesday, Nov 2 2010, 25 Heshvan 5771 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |We don't see things as they are, we see http://nadav.harel.org.il |them as we are. -- Anais Nin ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Hebrew spell-checking in OpenOffice
Actually, the lockup-for-many-seconds-bug was fixed by changing the encoding of the dictionary to UTF-8. (See http://www.openoffice.org/issues/show_bug.cgi?id=105490). Alan On 11/02/2010 01:09 PM, Nadav Har'El wrote: OpenOffice loads the hunspell-format dictionary (with so-called double affix compression) *much* faster than it does the old myspell format, which fixes the old lockup-for-many-seconds-while-loading-the-hebrew- dictionary bug (see http://qa.openoffice.org/issues/show_bug.cgi?id=66939). -- Alan Yaniger Tk Open Systems 0546-841-481 ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Hebrew spell-checking in OpenOffice
On Tue, Nov 02, 2010 at 01:25:18PM +0200, Nadav Har'El wrote: On Tue, Nov 02, 2010, Lior Kaplan wrote about Re: Hebrew spell-checking in OpenOffice: Known issue, and reported at http://www.openoffice.org/issues/show_bug.cgi?id=99796 Thanks for the pointer. I'll vote for the issue (if I can be of any other help, please let me know). The second issue is the correction suggestions for spelling errors. All the suggestions indeed appear to be valid words, but their order is terrible - it appears little or no attention was paid to trying to provide Dan checked, and it appears that the suboptimal (to be gentle) corrections indeed are not specific to OpenOffice, and happen already in hunspell. e.g., try $ echo עברי | hunspell -d he_IL Looking at the hunspell documentation, I see that there are TRY, REP and MAP keywords in the dictionary which can be used to specify letters that sound the same, and so on. We already used TRY, but not any of the others - and I guess we need to. Does anyone on this list have any experience with those? It did not get into hspell 1.1, but if you append the following lines to hunspell's .aff, you get some soundlikes (Fedora and RHEL6 have it though) I did not find a means to convey the lightweight of yod and waw in thunspell(4). It sounds as a reasonable feature request, though. MAP 10 MAP ךכח MAP םמ MAP ןנ MAP ףפ MAP ץצ MAP כק MAP אע # for English MAP גה # for Russian MAP צס # for Arabic MAP חכר # for French ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Hebrew spell-checking in OpenOffice
On Tue, Nov 02, 2010, Lior Kaplan wrote about Re: Hebrew spell-checking in OpenOffice: I've double checked this, and Debian doesn't include a tool needed for building the hunspell target. See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=602189 I see :( However, if we're talking about the OpenOffice package, not the debian package, you're not really constrained by what is available on Debian. The solution to #66939 was providing the dictionary in UTF8 instead of iso-8859-8 encoding. The freeze by oo.org was actually a conversion to UTF-8. Hunspell might be faster than myspell, but the difference is minor comparing to the UTF8 conversion. At the moment oo.org loads the dictionary in less the 1 sec. Oh, sorry. I guess I remembered it wrongly. You're right. I checked on my system, and the myspell-format dictionary takes 0.3 seconds to load, while the hunspell-format takes 0.1 seconds. Not a dramatic difference. The uncompressed size of the hunspell format is half that of myspell - again, not dramatic. I think the difference in memory use is more dramatic (9 MB vs. 36 MB in a test I just did). I never understood the UTF-8 problem, by the way. Was this bug ever fixed? No encoding conversion should have ever been this slow. Even if they would pipe to an external iconv process, it would still have been 100 times faster ;-) -- Nadav Har'El|Tuesday, Nov 2 2010, 25 Heshvan 5771 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Classical music: music written by a http://nadav.harel.org.il |decomposing composer. ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Hebrew spell-checking in OpenOffice
On Tue, Nov 2, 2010 at 2:44 PM, Nadav Har'El n...@math.technion.ac.ilwrote: On Tue, Nov 02, 2010, Lior Kaplan wrote about Re: Hebrew spell-checking in OpenOffice: I've double checked this, and Debian doesn't include a tool needed for building the hunspell target. See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=602189 I see :( However, if we're talking about the OpenOffice package, not the debian package, you're not really constrained by what is available on Debian. I'm constrained as I work (and package) on Debian. But I'll take the dictionary files from the Fedora RPM. I'll update you when a new extension will be ready and we'll test it. Kaplan ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il