Re: [Languagetool] Hunspell spellcheck performance

Ruud Baars Fri, 22 Jun 2012 12:17:19 -0700

Dominiqu,

Marcin once did that trick for Dutch, allowing for single quotes withina word.


Might check that part of the code.

Dutch even has the /Jans'/ (owned by Jans) as a word and /'s morgens/;in both cases these are not quotes, but apostrophes.


Ruud

Op 22-06-12 20:31, Dominique Pellé schreef:

Marcin Mi?kowski <list-addr...@wp.pl <mailto:list-addr...@wp.pl>> wrote:

...snip...
>> On top of that, there is also the idea of using Hunspell
>> only on words with UNKNOWN POS tag which may work
>> fine for some languages.
>
> This algorithm would be a waste of time: we can already use non-tagged
> words for displaying an error. It will be faster than any Hunspell rule.
> But most Hunspell dicts cover more words than our taggers, so it should
> not make any difference in timing, especially because we would have to
> some string processing for every sentence to map string portions to
> tokens. In some languages it won't help, if hunspell tokenization is
> different. This is why I didn't bother with this. Moreover, checking
> time is negligible. The crucial thing is the time spent for creating
> suggestions.


Hunspell is not equivalent to indicating all words
having POS tag "UNKNOWN" so using Hunspell
even without suggestion has value in my opinion
for 2 reasons:

- words are tokenized differently in Hunspell and in LanguageTool
- Hunspell contains different words

Example: I added this simple rule to rules/fr/grammar.xml (not checked-in)
to highlight all words having UNKNOWN POS tag:


<rule id="ORTHOGRAPHE" name="faute d'orthographe">
<pattern>
<marker><token postag="UNKNOWN"/></marker>
</pattern>
<message>Typo</message>

<example type="incorrect">Ce <marker>mott</marker> est malorthographié.</example><example type="correct">Ce <marker>mot</marker> est malorthographié.</example>

</rule>

This rule as-is is unusable. It highlights too many good words thatshould

not be highlighted  (unlike Hunspell which does a better job).

For example, with input sentence "Le 3e ppoint": words "3e" and'

"ppoint" have no POS), yet Hunspell only highlights "ppoint" as a typo(better)


Also with the sentence "Il arrive aujourd'hui.".   Hunspell
indicates no typo (which is correct) but above xml rule
indicates "aujourd" and "hui" as typos (which is wrong).
That happens here because LanguageTool tokenizes the
word "aujourd'hui"  (= today) whereas Hunspell does  not.
It would be better here to change the French tokenizer to
avoid splitting such words by the way (I won't do that before
version 1.8 though).  Similarly it would be better if things like L'
(L apostrophe) was only one token instead of 2  (as in
"L'Afrique").  Currently LT gives 3 tokens "L + '  + Afrique"
but 2 tokens "L'   + Afrique" would simplify writing xml
grammar rules.

I can also confirm that avoiding running Hunspell on
words having a known POS will not help to speed up.
Experiments confirms that it's the misspelled
words that cause huge slow down with Hunspell.  Correctly
spelled words have only little overhead.  This is how I
checked:

# Create a sample file of 1000 identical sentences *without* typos.
$ yes "This is a test. Does it work?" | sed 1000q > no-typo.txt

# Create a sample file of 1000 identical sentences *with* typos.
$ yes "Ths is a tesst. Dooes it wrk?" | sed 1000q > typos.txt

# Measure speed with/without Hunspell on the 2 files:

                       no-typo.txt   typos.txt
                      +------------+------------+
 without hunspell (1) | 5.569s     |   5.601s   |
 with hunspell    (2) | 6.235s     | 170.331s   |
                      +------------+------------+

So Hunspell is cheap when there are no typos (only +11%
overhead here) but *very* expensive (+2935% with above numbers)
when there are typos (doing suggestions). This confirm that even
if we avoided to feed Hunspell with words having known POS, it
would not noticeably help speed.  Of course this sample test
contains more typos than a regular text.

The 4 timing numbers in table were measured as follows:

time java -jar LanguageTool.jar -l en-US -d HUNSPELL_RULE no-typos.txt>/dev/null 2>&1

time java -jar LanguageTool.jar -l en-US no-typos.txt >/dev/null 2>&1

time java -jar LanguageTool.jar -l en-US -d HUNSPELL_RULE typos.txt>/dev/null 2>&1

time java -jar LanguageTool.jar -l en-US typos.txt >/dev/null 2>&1

Regards
-- Dominique


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/


_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: [Languagetool] Hunspell spellcheck performance

Reply via email to