Speaking of frequency lists, could we use Google n-grams? The license is 
Creative Commons Attribution 3.0 Unported License. I don't know how this 
would apply to a derivative work -- hunspell dictionary, basically LGPL 
+ MPL, plus this one = ?

Marcin

W dniu 2013-07-16 16:32, Ruud Baars pisze:
> By the way, I could help with words frequencies for some langauges.
> e.g. Portuguese, Spanish, Dutch.
>
> Ruud
>
> On 16-07-13 14:20, R.J. Baars wrote:
>> Coding word frequencies as a character is fine. I think it would be
>> classes, logarithmic as far as I am concerned.
>>
>> Ruud
>>
>>> W dniu 2013-07-16 00:03, Jaume Ortolà i Font pisze:
>>>> 2013/7/15 Marcin Miłkowski <list-addr...@wp.pl>:
>>>>> Hi Jaume,
>>>>>
>>>>> W dniu 2013-07-15 21:16, Jaume Ortolà i Font pisze:
>>>>>> Hi, Marcin.
>>>>>>
>>>>>> I have tested the current code (1.8.0-SNAPSHOT) and everything is OK,
>>>>>> all the changes are there. Thank you.
>>>>> Great. We'll release 1.7.1, this is just a minor bug fix.
>>>>>
>>>>> BTW, when you see something you want to fix, just make a fork on github
>>>>> to fix it, then file an issue, and then make a pull request associated
>>>>> with that issue. That way, it will be much easier to develop the
>>>>> library
>>>>> with your changes.
>>>> I'll try to do it.
>>>>
>>>>> Also, if you'll find time to use a proper way of removing duplicates
>>>>> (now we lose information from CandidateData that might be significant
>>>>> for something - I know this is me being fussy, this is quite clean).
>>>> There are different ways to do it:
>>>> - We could test for duplicates in addCandidate()...
>>>> - "candidates" could be a Set, but then it needs to be converted to a
>>>> List to be sorted...
>>> Not really. We can use a TreeSet with a custom comparator:
>>>
>>> http://stackoverflow.com/a/4165893
>>>
>>>> If you want to keep the distance information outside Speller.java,
>>>> that's a different a matter.
>>>>
>>>>
>>>> The next step for improving the suggestions would be to use a list of
>>>> frequent words. I'm thinking of just a list of manually selected words
>>>> or at most a few thousand words from a frequency dictionary.
>>> Yes. Frequency dictionaries would be very useful.
>>>
>>> I think we can represent frequency classes as ten ranges of percentages
>>> with 10 ASCII characters (A-K), as this would be in the tradition of the
>>> fsa encoding. So "A" would be the most common words (like 'the' and 'a'
>>> in English), etc. I think we don't need to have a better resolution here.
>>>
>>> Or we could simply use a numerical percentage in its decimal (rounded)
>>> representation from 000 to 100. This, however, would make the dictionary
>>> slightly bigger.
>>>
>>> Regards,
>>> Marcin
>>>
>>>> Regards,
>>>> Jaume
>>>>
>>>>
>>>>> Regards,
>>>>> Marcin
>>>>>
>>>>>> Now we need a release with the changes, and we'll be able to adapt the
>>>>>> tests.
>>>>>>
>>>>>> Regards,
>>>>>> Jaume
>>>>>> Salutacions,
>>>>>> Jaume OrtolÃ
>>>>>> www.riuraueditors.cat
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2013/7/15 Marcin Miłkowski <list-addr...@wp.pl>:
>>>>>>> W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze:
>>>>>>>> Thanks, Marcin.
>>>>>>>>
>>>>>>>> Some remarks. The improvements I sent to the list 15 days ago have
>>>>>>>> not
>>>>>>>> been added, and moreover I have found more bugs.
>>>>>>> I'm really sorry but there are 200 mails from the mailing list over
>>>>>>> the
>>>>>>> last two weeks and I have been away from my e-mail. Could you please
>>>>>>> add
>>>>>>> your changes as issues on github for morfologik-stemming? This way it
>>>>>>> would make it much easier for us to track these things.
>>>>>>>
>>>>>>>> I attach the code I'm using now and explain briefly the reasons for
>>>>>>>> the changes.
>>>>>>>>
>>>>>>>> - In the getAllReplacements method we need to make sure that the
>>>>>>>> replacements are done from left to right. We must complete the
>>>>>>>> for-loop of the replacement pairs, choose the first possible
>>>>>>>> replacement (form left to right) and then start the two new branches
>>>>>>>> (with and without replacement). Otherwise, some replacements are not
>>>>>>>> done.
>>>>>>> OK, this sounds OK. I integrated your changes.
>>>>>>>
>>>>>>>> - If there is "ss" as a key in the replacement pairs, and somebody
>>>>>>>> uses a long string of s ("ssssssssss...") as input text, this can
>>>>>>>> cause the method to consume all the memory, as the algorithm is
>>>>>>>> exponential (2^(number of replacements)). This happened to us in an
>>>>>>>> online server, and the LT server crashed. The depth of the recursive
>>>>>>>> algorithm should be limited to 4 o 5 levels at most.
>>>>>>> Is that in getAllReplacements()?
>>>>>>>
>>>>>>>> - It is possible that different "words to check" give the same
>>>>>>>> suggestion. So at some point we need to remove duplicates. I do this
>>>>>>>> at the end of findReplacements().
>>>>>>> You are right. We could probably write the same code in a slightly
>>>>>>> more
>>>>>>> elegant way, without converting this to a LinkedHashSet but simply by
>>>>>>> adding to a set when iterating the list.
>>>>>>>
>>>>>>>> - The conditions around line 238 (current github version 1.7) are
>>>>>>>> not
>>>>>>>> correct. The first isInDictionary makes the lower case conversion
>>>>>>>> useless:
>>>>>>>>
>>>>>>>>                          if (isInDictionary(wordChecked)
>>>>>>>>                                  &&
>>>>>>>> dictionaryMetadata.isConvertingCase()
>>>>>>>>                                  && isMixedCase(wordChecked)
>>>>>>>>                                  &&
>>>>>>>> isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale())))
>>>>>>>>
>>>>>>>> I think they should be something like:
>>>>>>>>
>>>>>>>>                if (isInDictionary(wordChecked)
>>>>>>>>                    || (dictionaryMetadata.convertCase
>>>>>>>>                    && isMixedCase(wordChecked)
>>>>>>>>                    && isInDictionary(wordChecked
>>>>>>>>                        
>>>>>>>> .toLowerCase(dictionaryMetadata.dictionaryLocale))))
>>>>>>> Fixed!
>>>>>>>
>>>>>>> I tried to add your fixes but your code is now quite far away from
>>>>>>> ours,
>>>>>>> so diff does not give any meaningful output. Please review the code
>>>>>>> on
>>>>>>> github, and if needed, file an issue over changes that need to be
>>>>>>> done.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Marcin
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Jaume OrtolÃ
>>>>>>>> Salutacions,
>>>>>>>> Jaume OrtolÃ
>>>>>>>> www.riuraueditors.cat
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2013/7/15 Marcin Miłkowski <list-addr...@wp.pl>:
>>>>>>>>> W dniu 2013-07-15 10:56, Marcin Miłkowski pisze:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Dawid just released morfologik 1.7 on Maven. So we can actually go
>>>>>>>>>> on
>>>>>>>>>> and include a newer version in LT.
>>>>>>>>>>
>>>>>>>>>> The new version still does not support compounding but it has all
>>>>>>>>>> the
>>>>>>>>>> features required for getting better diacritic suggestions.
>>>>>>>>> Here's the documentation:
>>>>>>>>>
>>>>>>>>> http://wiki.languagetool.org/hunspell-support#toc5
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Marcin
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Marcin
>>>>>>>>>>
>>>>>>>>>> W dniu 2013-07-02 08:59, Marcin Miłkowski pisze:
>>>>>>>>>>> W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze:
>>>>>>>>>>>> Hi Marcin,
>>>>>>>>>>>>
>>>>>>>>>>>> I have been using the still unreleased code of
>>>>>>>>>>>> morfologik-stemming and I
>>>>>>>>>>>> have made improvements to Speller.java for some previously
>>>>>>>>>>>> unforseen
>>>>>>>>>>>> cases. See the attachement.
>>>>>>>>>>>>
>>>>>>>>>>>> In order to complete the development, and test & debug with all
>>>>>>>>>>>> languages, perhaps we could include temporarily the morfologik
>>>>>>>>>>>> module
>>>>>>>>>>>> inside LanguageTool. This will make thinks easier. What do yo
>>>>>>>>>>>> think?
>>>>>>>>>>> No. I should make a release, forking morfologik makes no sense to
>>>>>>>>>>> me.
>>>>>>>>>>>
>>>>>>>>>>> The only thing that stops me is the lack of time to work on
>>>>>>>>>>> compounds.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Marcin
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> This SF.net email is sponsored by Windows:
>>>>>>>>>>>
>>>>>>>>>>> Build for Windows Store.
>>>>>>>>>>>
>>>>>>>>>>> http://p.sf.net/sfu/windows-dev2dev
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Languagetool-devel mailing list
>>>>>>>>>>> Languagetool-devel@lists.sourceforge.net
>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>> See everything from the browser to the database with AppDynamics
>>>>>>>>> Get end-to-end visibility with application monitoring from
>>>>>>>>> AppDynamics
>>>>>>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>>>>>>> Start your free trial of AppDynamics Pro today!
>>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>>>>>>> _______________________________________________
>>>>>>>>> Languagetool-devel mailing list
>>>>>>>>> Languagetool-devel@lists.sourceforge.net
>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>> See everything from the browser to the database with AppDynamics
>>>>>>>>> Get end-to-end visibility with application monitoring from
>>>>>>>>> AppDynamics
>>>>>>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>>>>>>> Start your free trial of AppDynamics Pro today!
>>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Languagetool-devel mailing list
>>>>>>>>> Languagetool-devel@lists.sourceforge.net
>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> See everything from the browser to the database with AppDynamics
>>>>>>> Get end-to-end visibility with application monitoring from
>>>>>>> AppDynamics
>>>>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>>>>> Start your free trial of AppDynamics Pro today!
>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>>>>> _______________________________________________
>>>>>>> Languagetool-devel mailing list
>>>>>>> Languagetool-devel@lists.sourceforge.net
>>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>>> ------------------------------------------------------------------------------
>>>>>> See everything from the browser to the database with AppDynamics
>>>>>> Get end-to-end visibility with application monitoring from AppDynamics
>>>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>>>> Start your free trial of AppDynamics Pro today!
>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>>>> _______________________________________________
>>>>>> Languagetool-devel mailing list
>>>>>> Languagetool-devel@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> See everything from the browser to the database with AppDynamics
>>>>> Get end-to-end visibility with application monitoring from AppDynamics
>>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>>> Start your free trial of AppDynamics Pro today!
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>>> _______________________________________________
>>>>> Languagetool-devel mailing list
>>>>> Languagetool-devel@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>> ------------------------------------------------------------------------------
>>>> See everything from the browser to the database with AppDynamics
>>>> Get end-to-end visibility with application monitoring from AppDynamics
>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>> Start your free trial of AppDynamics Pro today!
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Languagetool-devel mailing list
>>>> Languagetool-devel@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>
>>>
>>> ------------------------------------------------------------------------------
>>> See everything from the browser to the database with AppDynamics
>>> Get end-to-end visibility with application monitoring from AppDynamics
>>> Isolate bottlenecks and diagnose root cause in seconds.
>>> Start your free trial of AppDynamics Pro today!
>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Languagetool-devel mailing list
>>> Languagetool-devel@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> See everything from the browser to the database with AppDynamics
>> Get end-to-end visibility with application monitoring from AppDynamics
>> Isolate bottlenecks and diagnose root cause in seconds.
>> Start your free trial of AppDynamics Pro today!
>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Languagetool-devel mailing list
>> Languagetool-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
>
> ------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
>


------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to