Re: Hebrew spell-checking in OpenOffice

2010-11-05 Thread Lior Kaplan
On Tue, Nov 2, 2010 at 3:55 PM, Lior Kaplan kaplanl...@gmail.com wrote:

 On Tue, Nov 2, 2010 at 2:44 PM, Nadav Har'El n...@math.technion.ac.ilwrote:

 On Tue, Nov 02, 2010, Lior Kaplan wrote about Re: Hebrew spell-checking
 in OpenOffice:
  I've double checked this, and Debian doesn't include a tool needed for
  building the hunspell target. See
  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=602189

 I see :(

 However, if we're talking about the OpenOffice package, not the debian
 package, you're not really constrained by what is available on Debian.


 I'm constrained as I work (and package) on Debian. But I'll take the
 dictionary files from the Fedora RPM. I'll update you when a new extension
 will be ready and we'll test it.


New version is available. Dictionary taken from Fedora (hunspell format with
fastverb).

http://extensions.services.openoffice.org/en/project/dict-he

 Kaplan
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: Hebrew spell-checking in OpenOffice

2010-11-05 Thread Dotan Cohen
 The first issue is acronyms (rashei tevot) and abbreviations. In Hebrew,
 these use the geresh and gershaim (or single or double quotes), which is
 part of the word. OpenOffice does not understand that these quotes are part
 of the Hebrew word, and splits the word on them. As a result all acronyms are
 marked as spelling mistakes. This is really annoying, especially for certain
 types of documents where acronyms are common.


I filed this bug somewhere, but I cannot find it. While looking for it
I found this other bug that might interest you, though:
https://bugs.kde.org/show_bug.cgi?id=169403

If I find the hspell bug I'll post back.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: Hebrew spell-checking in OpenOffice

2010-11-02 Thread Lior Kaplan
2010/11/2 Nadav Har'El n...@math.technion.ac.il

 Recently I noticed that (thanks to Lior Kaplan, it seems) it is now trivial
 to get Hebrew spellchecking (based on Hspell 1.1) in OpenOffice.
 The Hebrew localized version (now available on the official OpenOffice
 site!)
 comes with Hebrew spell-checking pre-bundled, and there's an extension [1]
 for those who use the English version of open-office.


My pleasure (:

It's available only as the 3.3 RC releases, and will be available on the
final release.
http://download.openoffice.org/all_rc.html

The first issue is acronyms (rashei tevot) and abbreviations. In Hebrew,
 these use the geresh and gershaim (or single or double quotes), which is
 part of the word. OpenOffice does not understand that these quotes are part
 of the Hebrew word, and splits the word on them. As a result all acronyms
 are
 marked as spelling mistakes. This is really annoying, especially for
 certain
 types of documents where acronyms are common.


Known issue, and reported at
http://www.openoffice.org/issues/show_bug.cgi?id=99796

It is marked for work during the 3.4 release.


 The second issue is the correction suggestions for spelling errors. All
 the suggestions indeed appear to be valid words, but their order is
 terrible - it appears little or no attention was paid to trying to provide
 the most likely suggestions first. The screenshot on the extension page [1]
 provides an excellent example: When given the mis-spelling עיברי, rather
 than
 provide the most likely suggestion first - עברי, it is given as the 8th
 suggestion, and the first suggestions are highly unlikely.

[..]

 I believe that hunspell's dictionary in fact has a way to give such
 correction
 rules, but I don't know how to correctly write them, or how to make
 OpenOffice
 use them.


The word list in the extension is created with myspell's format. Hunspell
should be similar but I couldn't build that format at the time. The builds
were done as part of the debian hspell package which I maintain.
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: Hebrew spell-checking in OpenOffice

2010-11-02 Thread Nadav Har'El
On Tue, Nov 02, 2010, Lior Kaplan wrote about Re: Hebrew spell-checking in 
OpenOffice:
  I believe that hunspell's dictionary in fact has a way to give such
  correction
  rules, but I don't know how to correctly write them, or how to make
  OpenOffice
  use them.
 
 
 The word list in the extension is created with myspell's format. Hunspell
 should be similar but I couldn't build that format at the time. The builds
 were done as part of the debian hspell package which I maintain.

Please let me know if you need help creating a hunspell-format dictionary
from Hspell (it shouldn't be difficult - basically make hunspell should
do it).

OpenOffice loads the hunspell-format dictionary (with so-called double
affix compression) *much* faster than it does the old myspell format,
which fixes the old lockup-for-many-seconds-while-loading-the-hebrew-
dictionary bug (see http://qa.openoffice.org/issues/show_bug.cgi?id=66939).

So it is actually important that you use the hunspell target, not the
myspell target, in your packages.

-- 
Nadav Har'El|Tuesday, Nov  2 2010, 25 Heshvan 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |A facility for quotation covers the
http://nadav.harel.org.il   |absence of original thought.

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: Hebrew spell-checking in OpenOffice

2010-11-02 Thread Nadav Har'El
On Tue, Nov 02, 2010, Lior Kaplan wrote about Re: Hebrew spell-checking in 
OpenOffice:
 Known issue, and reported at
 http://www.openoffice.org/issues/show_bug.cgi?id=99796

Thanks for the pointer.
I'll vote for the issue (if I can be of any other help, please let me know).

  The second issue is the correction suggestions for spelling errors. All
  the suggestions indeed appear to be valid words, but their order is
  terrible - it appears little or no attention was paid to trying to provide

Dan checked, and it appears that the suboptimal (to be gentle) corrections 
indeed are not specific to OpenOffice, and happen already in hunspell. e.g.,
try

$ echo עברי | hunspell -d he_IL

Looking at the hunspell documentation, I see that there are TRY, REP and MAP
keywords in the dictionary which can be used to specify letters that sound
the same, and so on. We already used TRY, but not any of the others - and
I guess we need to. Does anyone on this list have any experience with those?

In particular, can one of these keywords be used to say that inserting
or deleting waw or yod is more likely then inserting or deleting a gimel?

-- 
Nadav Har'El|Tuesday, Nov  2 2010, 25 Heshvan 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |We don't see things as they are, we see
http://nadav.harel.org.il   |them as we are. -- Anais Nin

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: Hebrew spell-checking in OpenOffice

2010-11-02 Thread Alan Yaniger
Actually, the lockup-for-many-seconds-bug was fixed by changing the 
encoding of the dictionary to UTF-8. (See

http://www.openoffice.org/issues/show_bug.cgi?id=105490).

Alan

On 11/02/2010 01:09 PM, Nadav Har'El wrote:


OpenOffice loads the hunspell-format dictionary (with so-called double
affix compression) *much* faster than it does the old myspell format,
which fixes the old lockup-for-many-seconds-while-loading-the-hebrew-
dictionary bug (see http://qa.openoffice.org/issues/show_bug.cgi?id=66939).


   



--
Alan Yaniger
Tk Open Systems
0546-841-481


___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: Hebrew spell-checking in OpenOffice

2010-11-02 Thread Dan Kenigsberg
On Tue, Nov 02, 2010 at 01:25:18PM +0200, Nadav Har'El wrote:
 On Tue, Nov 02, 2010, Lior Kaplan wrote about Re: Hebrew spell-checking in 
 OpenOffice:
  Known issue, and reported at
  http://www.openoffice.org/issues/show_bug.cgi?id=99796
 
 Thanks for the pointer.
 I'll vote for the issue (if I can be of any other help, please let me know).
 
   The second issue is the correction suggestions for spelling errors. All
   the suggestions indeed appear to be valid words, but their order is
   terrible - it appears little or no attention was paid to trying to provide
 
 Dan checked, and it appears that the suboptimal (to be gentle) corrections 
 indeed are not specific to OpenOffice, and happen already in hunspell. e.g.,
 try
 
   $ echo עברי | hunspell -d he_IL
 
 Looking at the hunspell documentation, I see that there are TRY, REP and MAP
 keywords in the dictionary which can be used to specify letters that sound
 the same, and so on. We already used TRY, but not any of the others - and
 I guess we need to. Does anyone on this list have any experience with those?

It did not get into hspell 1.1, but if you append the following lines to
hunspell's .aff, you get some soundlikes (Fedora and RHEL6 have it though)
I did not find a means to convey the lightweight of yod and waw in thunspell(4).
It sounds as a reasonable feature request, though.

MAP 10
MAP ךכח
MAP םמ
MAP ןנ
MAP ףפ
MAP ץצ
MAP כק
MAP אע # for English
MAP גה # for Russian
MAP צס # for Arabic
MAP חכר # for French

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: Hebrew spell-checking in OpenOffice

2010-11-02 Thread Nadav Har'El
On Tue, Nov 02, 2010, Lior Kaplan wrote about Re: Hebrew spell-checking in 
OpenOffice:
 I've double checked this, and Debian doesn't include a tool needed for
 building the hunspell target. See
 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=602189

I see :(

However, if we're talking about the OpenOffice package, not the debian
package, you're not really constrained by what is available on Debian.

 The solution to #66939 was providing the dictionary in UTF8 instead of
 iso-8859-8 encoding. The freeze by oo.org was actually a conversion to
 UTF-8.
 
 Hunspell might be faster than myspell, but the difference is minor comparing
 to the UTF8 conversion. At the moment oo.org loads the dictionary in less
 the 1 sec.

Oh, sorry. I guess I remembered it wrongly.

You're right. I checked on my system, and the myspell-format dictionary takes
0.3 seconds to load, while the hunspell-format takes 0.1 seconds. Not a
dramatic difference. The uncompressed size of the hunspell format is half
that of myspell - again, not dramatic. I think the difference in memory use
is more dramatic (9 MB vs. 36 MB in a test I just did).

I never understood the UTF-8 problem, by the way. Was this bug ever fixed?
No encoding conversion should have ever been this slow. Even if they would
pipe to an external iconv process, it would still have been 100 times
faster ;-)

-- 
Nadav Har'El|Tuesday, Nov  2 2010, 25 Heshvan 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Classical music: music written by a
http://nadav.harel.org.il   |decomposing composer.

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: Hebrew spell-checking in OpenOffice

2010-11-02 Thread Lior Kaplan
On Tue, Nov 2, 2010 at 2:44 PM, Nadav Har'El n...@math.technion.ac.ilwrote:

 On Tue, Nov 02, 2010, Lior Kaplan wrote about Re: Hebrew spell-checking in
 OpenOffice:
  I've double checked this, and Debian doesn't include a tool needed for
  building the hunspell target. See
  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=602189

 I see :(

 However, if we're talking about the OpenOffice package, not the debian
 package, you're not really constrained by what is available on Debian.


I'm constrained as I work (and package) on Debian. But I'll take the
dictionary files from the Fedora RPM. I'll update you when a new extension
will be ready and we'll test it.

Kaplan
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il