Re: MorfologikSpeller

Marcin Miłkowski Wed, 03 Sep 2014 05:46:07 -0700

W dniu 2014-09-03 14:26, R.J. Baars pisze:
>
> Marcin,
>
> I filtered the frequencies for any word found more than 50 times; thus
> 800.000 frequencies, about 4 times the size of the internet file.
> It adds about 0,4 MB to the dictionary, now in total 9.7 MB.
>
> The dictionary still needs some improvement (full upercase words longer
> than 5 chars are in there e.g., not confoming advice of the Dutch Language
> Union.
> But that is for later concern.
>
> I added lower- and uppercased words, because I am not sure what algorithms
> are used for case. If the word found is 'Fuond', and 'found' is in the
> dictionary, I assume default behaviour is to suggest 'Found'. Accepted
> forms are 'found', 'Found' and 'FOUND'. (Is that assumption correct?)


Yes.

>
> I need some words to be only accepted in lowercase, like 'tv', which only
> has the correct forms 'Tv' and 'tv'; 'TV' is wrong. Same for soem other
> words. (In hunspell I used the keepcase flag on those words).

Hm, I'm not sure. But you can easily put that to a separate common 
simple mistakes file (for SimpleReplaceRule). I found maintaining such a 
file easier than trying to use the same dictionary-search method for 
suggestions. It was particularly difficult for two- and three-letter 
words, and with a SimpleReplaceRule it's just a matter of putting the 
word to the file like this:

TV      tv

And appropriate uppercasing will be applied by the rule anyway.

>
> So I have now a dictionary to test, and to tune for replacements.
> Is there a way to run a words list through this speller and get the
> suggestions out?

You could simply replace the file for one of the English variants and 
run LT on the command line with only spelling rule enabled. For example, 
for British English, simply enable only MORFOLOGIK_RULE_EN_GB (the 
command-line switch is "-e MORFOLOGIK_RULE_EN_GB"). That should be the 
easiest way. And you can then compare how it worked on the same file 
with the Dutch hunspell enabled (as you don't have to touch the Dutch 
files yet).

Marcin

>
> Ruud
>
>> W dniu 2014-09-03 12:30, R.J. Baars pisze:
>>> Marcin,
>>>
>>> For English, there are .info files in /resource/ as well as in
>>> /resource/hunspell.
>>> First seems to be for the tagging dict, second for the speller.
>> Ah, of course, there should be one .info file per one .dict file. I
>> thought you were asking about one dictionary file.
>>
>>>
>>> (I would prefer spell-checker for directory name.)
>>>
>>> The content of the info file for Dutch should probably be:
>>> fsa.dict.speller.ignore-numbers=false
>>> fsa.dict.speller.ignore-all-uppercase=false
>>> fsa.dict.speller.ignore-camel-case=true
>>> fsa.dict.speller.ignore-punctuation=false
>> Note: if you don't have all punctuation in your dictionary, this will
>> make the speller complain on all commas, colons, hyphens etc.
>>
>>> fsa.dict.input-conversion=ij &#307;, IJ &#306;
>>
>> You need to use normal Unicode here or Java escaping, not HTML escaping.
>>
>>> fsa.dict.output-conversion=&#307; ij, &#306; IJ
>> Do you have such characters in the dictionary file? If not, then you
>> don't need the output conversion.
>>
>>> fsa.dict.speller.runon-words=false
>>> fsa.dict.speller.locale=nl_NL
>>> fsa.dict.speller.convert-case=false
>>> fsa.dict.speller.ignore-diacritics=true
>>> fsa.dict.speller.replacement-pairs=y &#307;, ei &#307;
>>> fsa.dict.speller.equivalent-chars=
>>> fsa.dict.frequency-included=true
>>> fsa.dict.encoding=utf-8
>>> fsa.dict.separator=
>>> fsa.dict.author=R. Baars;
>>>
>>> I am not sure about separator , equivalent chars and the locale.
>> Separator is just used for internal management (usually it's a plus
>> character). Doesn't really matter unless you want to use "+" as an entry
>> (and you would have to if you have "ignore-punctuation" set to false).
>>
>>> I don quite get the difference between diacritics, equivalent chars and
>>> replacment pairs. Diacritics seems to me to be part of equivalent and is
>>> a
>>> kind of automatic replacement.
>> Diacritics is automatic and faster than replacement pairs. Roughly the
>> same as equivalent chars.
>>
>>> ei ij is a replacement, Ã¡ and a are taken care of by diacritics, and I
>>> guess Dutch does not have equivalents ...
>>>
>>> Right?
>> What about apostrophes? Do you want them normalized or not?
>>
>> Regards,
>> Marcin
>>
>>>
>>>
>>>
>>>> W dniu 2014-09-03 10:58, R.J. Baars pisze:
>>>>> To add the words frequencis, I am directed by the wiki to an address
>>>>> where
>>>>> there is a frequency list indeed. But only 187000 words; while I have
>>>>> 1.2
>>>>> million Dutch words and their frequency myself.
>>>> Probably the probabilities of their occurrence is quite low. I tried
>>>> replacing that list with a bigger one for Polish and my results indeed
>>>> made the dictionary file bigger but nothing else changed much.
>>>>
>>>>> The frequency is just a number; what is expected there? I this number
>>>>> a
>>>>> plain ratio, a occurrence count, or something else, like logarithmic?
>>>>> Will I have to convert to that format, or is a plain word<tab>number
>>>>> an
>>>>> option too?
>>>> Log scale, I believe. You might want to filter out some of the lower
>>>> results, as well, as they don't really help and only make files bigger.
>>>>
>>>> Marcin
>>>>
>>>>> Ruud
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Slashdot TV.
>>>>> Video for Nerds.  Stuff that matters.
>>>>> http://tv.slashdot.org/
>>>>> _______________________________________________
>>>>> Languagetool-devel mailing list
>>>>> Languagetool-devel@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>>
>>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Slashdot TV.
>>>> Video for Nerds.  Stuff that matters.
>>>> http://tv.slashdot.org/
>>>> _______________________________________________
>>>> Languagetool-devel mailing list
>>>> Languagetool-devel@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Slashdot TV.
>>> Video for Nerds.  Stuff that matters.
>>> http://tv.slashdot.org/
>>> _______________________________________________
>>> Languagetool-devel mailing list
>>> Languagetool-devel@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Slashdot TV.
>> Video for Nerds.  Stuff that matters.
>> http://tv.slashdot.org/
>> _______________________________________________
>> Languagetool-devel mailing list
>> Languagetool-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>
>
>
>
> ------------------------------------------------------------------------------
> Slashdot TV.
> Video for Nerds.  Stuff that matters.
> http://tv.slashdot.org/
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
>


------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: MorfologikSpeller

Reply via email to