Speaking of frequency lists, could we use Google n-grams? The license is Creative Commons Attribution 3.0 Unported License. I don't know how this would apply to a derivative work -- hunspell dictionary, basically LGPL + MPL, plus this one = ?
Marcin W dniu 2013-07-16 16:32, Ruud Baars pisze: > By the way, I could help with words frequencies for some langauges. > e.g. Portuguese, Spanish, Dutch. > > Ruud > > On 16-07-13 14:20, R.J. Baars wrote: >> Coding word frequencies as a character is fine. I think it would be >> classes, logarithmic as far as I am concerned. >> >> Ruud >> >>> W dniu 2013-07-16 00:03, Jaume Ortolà i Font pisze: >>>> 2013/7/15 Marcin Miłkowski <list-addr...@wp.pl>: >>>>> Hi Jaume, >>>>> >>>>> W dniu 2013-07-15 21:16, Jaume Ortolà i Font pisze: >>>>>> Hi, Marcin. >>>>>> >>>>>> I have tested the current code (1.8.0-SNAPSHOT) and everything is OK, >>>>>> all the changes are there. Thank you. >>>>> Great. We'll release 1.7.1, this is just a minor bug fix. >>>>> >>>>> BTW, when you see something you want to fix, just make a fork on github >>>>> to fix it, then file an issue, and then make a pull request associated >>>>> with that issue. That way, it will be much easier to develop the >>>>> library >>>>> with your changes. >>>> I'll try to do it. >>>> >>>>> Also, if you'll find time to use a proper way of removing duplicates >>>>> (now we lose information from CandidateData that might be significant >>>>> for something - I know this is me being fussy, this is quite clean). >>>> There are different ways to do it: >>>> - We could test for duplicates in addCandidate()... >>>> - "candidates" could be a Set, but then it needs to be converted to a >>>> List to be sorted... >>> Not really. We can use a TreeSet with a custom comparator: >>> >>> http://stackoverflow.com/a/4165893 >>> >>>> If you want to keep the distance information outside Speller.java, >>>> that's a different a matter. >>>> >>>> >>>> The next step for improving the suggestions would be to use a list of >>>> frequent words. I'm thinking of just a list of manually selected words >>>> or at most a few thousand words from a frequency dictionary. >>> Yes. Frequency dictionaries would be very useful. >>> >>> I think we can represent frequency classes as ten ranges of percentages >>> with 10 ASCII characters (A-K), as this would be in the tradition of the >>> fsa encoding. So "A" would be the most common words (like 'the' and 'a' >>> in English), etc. I think we don't need to have a better resolution here. >>> >>> Or we could simply use a numerical percentage in its decimal (rounded) >>> representation from 000 to 100. This, however, would make the dictionary >>> slightly bigger. >>> >>> Regards, >>> Marcin >>> >>>> Regards, >>>> Jaume >>>> >>>> >>>>> Regards, >>>>> Marcin >>>>> >>>>>> Now we need a release with the changes, and we'll be able to adapt the >>>>>> tests. >>>>>> >>>>>> Regards, >>>>>> Jaume >>>>>> Salutacions, >>>>>> Jaume Ortolà >>>>>> www.riuraueditors.cat >>>>>> >>>>>> >>>>>> >>>>>> 2013/7/15 Marcin Miłkowski <list-addr...@wp.pl>: >>>>>>> W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze: >>>>>>>> Thanks, Marcin. >>>>>>>> >>>>>>>> Some remarks. The improvements I sent to the list 15 days ago have >>>>>>>> not >>>>>>>> been added, and moreover I have found more bugs. >>>>>>> I'm really sorry but there are 200 mails from the mailing list over >>>>>>> the >>>>>>> last two weeks and I have been away from my e-mail. Could you please >>>>>>> add >>>>>>> your changes as issues on github for morfologik-stemming? This way it >>>>>>> would make it much easier for us to track these things. >>>>>>> >>>>>>>> I attach the code I'm using now and explain briefly the reasons for >>>>>>>> the changes. >>>>>>>> >>>>>>>> - In the getAllReplacements method we need to make sure that the >>>>>>>> replacements are done from left to right. We must complete the >>>>>>>> for-loop of the replacement pairs, choose the first possible >>>>>>>> replacement (form left to right) and then start the two new branches >>>>>>>> (with and without replacement). Otherwise, some replacements are not >>>>>>>> done. >>>>>>> OK, this sounds OK. I integrated your changes. >>>>>>> >>>>>>>> - If there is "ss" as a key in the replacement pairs, and somebody >>>>>>>> uses a long string of s ("ssssssssss...") as input text, this can >>>>>>>> cause the method to consume all the memory, as the algorithm is >>>>>>>> exponential (2^(number of replacements)). This happened to us in an >>>>>>>> online server, and the LT server crashed. The depth of the recursive >>>>>>>> algorithm should be limited to 4 o 5 levels at most. >>>>>>> Is that in getAllReplacements()? >>>>>>> >>>>>>>> - It is possible that different "words to check" give the same >>>>>>>> suggestion. So at some point we need to remove duplicates. I do this >>>>>>>> at the end of findReplacements(). >>>>>>> You are right. We could probably write the same code in a slightly >>>>>>> more >>>>>>> elegant way, without converting this to a LinkedHashSet but simply by >>>>>>> adding to a set when iterating the list. >>>>>>> >>>>>>>> - The conditions around line 238 (current github version 1.7) are >>>>>>>> not >>>>>>>> correct. The first isInDictionary makes the lower case conversion >>>>>>>> useless: >>>>>>>> >>>>>>>> if (isInDictionary(wordChecked) >>>>>>>> && >>>>>>>> dictionaryMetadata.isConvertingCase() >>>>>>>> && isMixedCase(wordChecked) >>>>>>>> && >>>>>>>> isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale()))) >>>>>>>> >>>>>>>> I think they should be something like: >>>>>>>> >>>>>>>> if (isInDictionary(wordChecked) >>>>>>>> || (dictionaryMetadata.convertCase >>>>>>>> && isMixedCase(wordChecked) >>>>>>>> && isInDictionary(wordChecked >>>>>>>> >>>>>>>> .toLowerCase(dictionaryMetadata.dictionaryLocale)))) >>>>>>> Fixed! >>>>>>> >>>>>>> I tried to add your fixes but your code is now quite far away from >>>>>>> ours, >>>>>>> so diff does not give any meaningful output. Please review the code >>>>>>> on >>>>>>> github, and if needed, file an issue over changes that need to be >>>>>>> done. >>>>>>> >>>>>>> Regards, >>>>>>> Marcin >>>>>>> >>>>>>>> Regards, >>>>>>>> Jaume Ortolà >>>>>>>> Salutacions, >>>>>>>> Jaume Ortolà >>>>>>>> www.riuraueditors.cat >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 2013/7/15 Marcin Miłkowski <list-addr...@wp.pl>: >>>>>>>>> W dniu 2013-07-15 10:56, Marcin Miłkowski pisze: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> Dawid just released morfologik 1.7 on Maven. So we can actually go >>>>>>>>>> on >>>>>>>>>> and include a newer version in LT. >>>>>>>>>> >>>>>>>>>> The new version still does not support compounding but it has all >>>>>>>>>> the >>>>>>>>>> features required for getting better diacritic suggestions. >>>>>>>>> Here's the documentation: >>>>>>>>> >>>>>>>>> http://wiki.languagetool.org/hunspell-support#toc5 >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Marcin >>>>>>>>> >>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Marcin >>>>>>>>>> >>>>>>>>>> W dniu 2013-07-02 08:59, Marcin Miłkowski pisze: >>>>>>>>>>> W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze: >>>>>>>>>>>> Hi Marcin, >>>>>>>>>>>> >>>>>>>>>>>> I have been using the still unreleased code of >>>>>>>>>>>> morfologik-stemming and I >>>>>>>>>>>> have made improvements to Speller.java for some previously >>>>>>>>>>>> unforseen >>>>>>>>>>>> cases. See the attachement. >>>>>>>>>>>> >>>>>>>>>>>> In order to complete the development, and test & debug with all >>>>>>>>>>>> languages, perhaps we could include temporarily the morfologik >>>>>>>>>>>> module >>>>>>>>>>>> inside LanguageTool. This will make thinks easier. What do yo >>>>>>>>>>>> think? >>>>>>>>>>> No. I should make a release, forking morfologik makes no sense to >>>>>>>>>>> me. >>>>>>>>>>> >>>>>>>>>>> The only thing that stops me is the lack of time to work on >>>>>>>>>>> compounds. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Marcin >>>>>>>>>>> >>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>> >>>>>>>>>>> This SF.net email is sponsored by Windows: >>>>>>>>>>> >>>>>>>>>>> Build for Windows Store. >>>>>>>>>>> >>>>>>>>>>> http://p.sf.net/sfu/windows-dev2dev >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Languagetool-devel mailing list >>>>>>>>>>> Languagetool-devel@lists.sourceforge.net >>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>>>>>>>>> >>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>> See everything from the browser to the database with AppDynamics >>>>>>>>> Get end-to-end visibility with application monitoring from >>>>>>>>> AppDynamics >>>>>>>>> Isolate bottlenecks and diagnose root cause in seconds. >>>>>>>>> Start your free trial of AppDynamics Pro today! >>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>>>>>>>> _______________________________________________ >>>>>>>>> Languagetool-devel mailing list >>>>>>>>> Languagetool-devel@lists.sourceforge.net >>>>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>>>>>>> >>>>>>>>> >>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>> See everything from the browser to the database with AppDynamics >>>>>>>>> Get end-to-end visibility with application monitoring from >>>>>>>>> AppDynamics >>>>>>>>> Isolate bottlenecks and diagnose root cause in seconds. >>>>>>>>> Start your free trial of AppDynamics Pro today! >>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Languagetool-devel mailing list >>>>>>>>> Languagetool-devel@lists.sourceforge.net >>>>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>>>>> ------------------------------------------------------------------------------ >>>>>>> See everything from the browser to the database with AppDynamics >>>>>>> Get end-to-end visibility with application monitoring from >>>>>>> AppDynamics >>>>>>> Isolate bottlenecks and diagnose root cause in seconds. >>>>>>> Start your free trial of AppDynamics Pro today! >>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>>>>>> _______________________________________________ >>>>>>> Languagetool-devel mailing list >>>>>>> Languagetool-devel@lists.sourceforge.net >>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>>>> ------------------------------------------------------------------------------ >>>>>> See everything from the browser to the database with AppDynamics >>>>>> Get end-to-end visibility with application monitoring from AppDynamics >>>>>> Isolate bottlenecks and diagnose root cause in seconds. >>>>>> Start your free trial of AppDynamics Pro today! >>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>>>>> _______________________________________________ >>>>>> Languagetool-devel mailing list >>>>>> Languagetool-devel@lists.sourceforge.net >>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> See everything from the browser to the database with AppDynamics >>>>> Get end-to-end visibility with application monitoring from AppDynamics >>>>> Isolate bottlenecks and diagnose root cause in seconds. >>>>> Start your free trial of AppDynamics Pro today! >>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>>>> _______________________________________________ >>>>> Languagetool-devel mailing list >>>>> Languagetool-devel@lists.sourceforge.net >>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>> ------------------------------------------------------------------------------ >>>> See everything from the browser to the database with AppDynamics >>>> Get end-to-end visibility with application monitoring from AppDynamics >>>> Isolate bottlenecks and diagnose root cause in seconds. >>>> Start your free trial of AppDynamics Pro today! >>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>>> _______________________________________________ >>>> Languagetool-devel mailing list >>>> Languagetool-devel@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>> >>> >>> ------------------------------------------------------------------------------ >>> See everything from the browser to the database with AppDynamics >>> Get end-to-end visibility with application monitoring from AppDynamics >>> Isolate bottlenecks and diagnose root cause in seconds. >>> Start your free trial of AppDynamics Pro today! >>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>> _______________________________________________ >>> Languagetool-devel mailing list >>> Languagetool-devel@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>> >> >> >> ------------------------------------------------------------------------------ >> See everything from the browser to the database with AppDynamics >> Get end-to-end visibility with application monitoring from AppDynamics >> Isolate bottlenecks and diagnose root cause in seconds. >> Start your free trial of AppDynamics Pro today! >> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >> _______________________________________________ >> Languagetool-devel mailing list >> Languagetool-devel@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/languagetool-devel > > > ------------------------------------------------------------------------------ > See everything from the browser to the database with AppDynamics > Get end-to-end visibility with application monitoring from AppDynamics > Isolate bottlenecks and diagnose root cause in seconds. > Start your free trial of AppDynamics Pro today! > http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk > _______________________________________________ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > > ------------------------------------------------------------------------------ See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel