Coding word frequencies as a character is fine. I think it would be classes, logarithmic as far as I am concerned.
Ruud > W dniu 2013-07-16 00:03, Jaume Ortolà i Font pisze: >> 2013/7/15 Marcin MiÅkowski <list-addr...@wp.pl>: >>> Hi Jaume, >>> >>> W dniu 2013-07-15 21:16, Jaume Ortolà i Font pisze: >>>> Hi, Marcin. >>>> >>>> I have tested the current code (1.8.0-SNAPSHOT) and everything is OK, >>>> all the changes are there. Thank you. >>> >>> Great. We'll release 1.7.1, this is just a minor bug fix. >>> >>> BTW, when you see something you want to fix, just make a fork on github >>> to fix it, then file an issue, and then make a pull request associated >>> with that issue. That way, it will be much easier to develop the >>> library >>> with your changes. >> >> I'll try to do it. >> >>> Also, if you'll find time to use a proper way of removing duplicates >>> (now we lose information from CandidateData that might be significant >>> for something - I know this is me being fussy, this is quite clean). >> >> There are different ways to do it: >> - We could test for duplicates in addCandidate()... >> - "candidates" could be a Set, but then it needs to be converted to a >> List to be sorted... > > Not really. We can use a TreeSet with a custom comparator: > > http://stackoverflow.com/a/4165893 > >> >> If you want to keep the distance information outside Speller.java, >> that's a different a matter. >> >> >> The next step for improving the suggestions would be to use a list of >> frequent words. I'm thinking of just a list of manually selected words >> or at most a few thousand words from a frequency dictionary. > > Yes. Frequency dictionaries would be very useful. > > I think we can represent frequency classes as ten ranges of percentages > with 10 ASCII characters (A-K), as this would be in the tradition of the > fsa encoding. So "A" would be the most common words (like 'the' and 'a' > in English), etc. I think we don't need to have a better resolution here. > > Or we could simply use a numerical percentage in its decimal (rounded) > representation from 000 to 100. This, however, would make the dictionary > slightly bigger. > > Regards, > Marcin > >> >> Regards, >> Jaume >> >> >>> Regards, >>> Marcin >>> >>>> >>>> Now we need a release with the changes, and we'll be able to adapt the >>>> tests. >>>> >>>> Regards, >>>> Jaume >>>> Salutacions, >>>> Jaume Ortolà >>>> www.riuraueditors.cat >>>> >>>> >>>> >>>> 2013/7/15 Marcin MiÅkowski <list-addr...@wp.pl>: >>>>> W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze: >>>>>> Thanks, Marcin. >>>>>> >>>>>> Some remarks. The improvements I sent to the list 15 days ago have >>>>>> not >>>>>> been added, and moreover I have found more bugs. >>>>> I'm really sorry but there are 200 mails from the mailing list over >>>>> the >>>>> last two weeks and I have been away from my e-mail. Could you please >>>>> add >>>>> your changes as issues on github for morfologik-stemming? This way it >>>>> would make it much easier for us to track these things. >>>>> >>>>>> I attach the code I'm using now and explain briefly the reasons for >>>>>> the changes. >>>>>> >>>>>> - In the getAllReplacements method we need to make sure that the >>>>>> replacements are done from left to right. We must complete the >>>>>> for-loop of the replacement pairs, choose the first possible >>>>>> replacement (form left to right) and then start the two new branches >>>>>> (with and without replacement). Otherwise, some replacements are not >>>>>> done. >>>>> OK, this sounds OK. I integrated your changes. >>>>> >>>>>> - If there is "ss" as a key in the replacement pairs, and somebody >>>>>> uses a long string of s ("ssssssssss...") as input text, this can >>>>>> cause the method to consume all the memory, as the algorithm is >>>>>> exponential (2^(number of replacements)). This happened to us in an >>>>>> online server, and the LT server crashed. The depth of the recursive >>>>>> algorithm should be limited to 4 o 5 levels at most. >>>>> Is that in getAllReplacements()? >>>>> >>>>>> - It is possible that different "words to check" give the same >>>>>> suggestion. So at some point we need to remove duplicates. I do this >>>>>> at the end of findReplacements(). >>>>> You are right. We could probably write the same code in a slightly >>>>> more >>>>> elegant way, without converting this to a LinkedHashSet but simply by >>>>> adding to a set when iterating the list. >>>>> >>>>>> - The conditions around line 238 (current github version 1.7) are >>>>>> not >>>>>> correct. The first isInDictionary makes the lower case conversion >>>>>> useless: >>>>>> >>>>>> if (isInDictionary(wordChecked) >>>>>> && >>>>>> dictionaryMetadata.isConvertingCase() >>>>>> && isMixedCase(wordChecked) >>>>>> && >>>>>> isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale()))) >>>>>> >>>>>> I think they should be something like: >>>>>> >>>>>> if (isInDictionary(wordChecked) >>>>>> || (dictionaryMetadata.convertCase >>>>>> && isMixedCase(wordChecked) >>>>>> && isInDictionary(wordChecked >>>>>> .toLowerCase(dictionaryMetadata.dictionaryLocale)))) >>>>> Fixed! >>>>> >>>>> I tried to add your fixes but your code is now quite far away from >>>>> ours, >>>>> so diff does not give any meaningful output. Please review the code >>>>> on >>>>> github, and if needed, file an issue over changes that need to be >>>>> done. >>>>> >>>>> Regards, >>>>> Marcin >>>>> >>>>>> Regards, >>>>>> Jaume Ortolà >>>>>> Salutacions, >>>>>> Jaume Ortolà >>>>>> www.riuraueditors.cat >>>>>> >>>>>> >>>>>> >>>>>> 2013/7/15 Marcin MiÅkowski <list-addr...@wp.pl>: >>>>>>> W dniu 2013-07-15 10:56, Marcin MiÅkowski pisze: >>>>>>>> Hi, >>>>>>>> >>>>>>>> Dawid just released morfologik 1.7 on Maven. So we can actually go >>>>>>>> on >>>>>>>> and include a newer version in LT. >>>>>>>> >>>>>>>> The new version still does not support compounding but it has all >>>>>>>> the >>>>>>>> features required for getting better diacritic suggestions. >>>>>>> Here's the documentation: >>>>>>> >>>>>>> http://wiki.languagetool.org/hunspell-support#toc5 >>>>>>> >>>>>>> Best, >>>>>>> Marcin >>>>>>> >>>>>>> >>>>>>>> Best, >>>>>>>> Marcin >>>>>>>> >>>>>>>> W dniu 2013-07-02 08:59, Marcin MiÅkowski pisze: >>>>>>>>> W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze: >>>>>>>>>> Hi Marcin, >>>>>>>>>> >>>>>>>>>> I have been using the still unreleased code of >>>>>>>>>> morfologik-stemming and I >>>>>>>>>> have made improvements to Speller.java for some previously >>>>>>>>>> unforseen >>>>>>>>>> cases. See the attachement. >>>>>>>>>> >>>>>>>>>> In order to complete the development, and test & debug with all >>>>>>>>>> languages, perhaps we could include temporarily the morfologik >>>>>>>>>> module >>>>>>>>>> inside LanguageTool. This will make thinks easier. What do yo >>>>>>>>>> think? >>>>>>>>> No. I should make a release, forking morfologik makes no sense to >>>>>>>>> me. >>>>>>>>> >>>>>>>>> The only thing that stops me is the lack of time to work on >>>>>>>>> compounds. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Marcin >>>>>>>>> >>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>> >>>>>>>>> This SF.net email is sponsored by Windows: >>>>>>>>> >>>>>>>>> Build for Windows Store. >>>>>>>>> >>>>>>>>> http://p.sf.net/sfu/windows-dev2dev >>>>>>>>> _______________________________________________ >>>>>>>>> Languagetool-devel mailing list >>>>>>>>> Languagetool-devel@lists.sourceforge.net >>>>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> See everything from the browser to the database with AppDynamics >>>>>>> Get end-to-end visibility with application monitoring from >>>>>>> AppDynamics >>>>>>> Isolate bottlenecks and diagnose root cause in seconds. >>>>>>> Start your free trial of AppDynamics Pro today! >>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>>>>>> _______________________________________________ >>>>>>> Languagetool-devel mailing list >>>>>>> Languagetool-devel@lists.sourceforge.net >>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> See everything from the browser to the database with AppDynamics >>>>>>> Get end-to-end visibility with application monitoring from >>>>>>> AppDynamics >>>>>>> Isolate bottlenecks and diagnose root cause in seconds. >>>>>>> Start your free trial of AppDynamics Pro today! >>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Languagetool-devel mailing list >>>>>>> Languagetool-devel@lists.sourceforge.net >>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> See everything from the browser to the database with AppDynamics >>>>> Get end-to-end visibility with application monitoring from >>>>> AppDynamics >>>>> Isolate bottlenecks and diagnose root cause in seconds. >>>>> Start your free trial of AppDynamics Pro today! >>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>>>> _______________________________________________ >>>>> Languagetool-devel mailing list >>>>> Languagetool-devel@lists.sourceforge.net >>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>> ------------------------------------------------------------------------------ >>>> See everything from the browser to the database with AppDynamics >>>> Get end-to-end visibility with application monitoring from AppDynamics >>>> Isolate bottlenecks and diagnose root cause in seconds. >>>> Start your free trial of AppDynamics Pro today! >>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>>> _______________________________________________ >>>> Languagetool-devel mailing list >>>> Languagetool-devel@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>> >>> >>> ------------------------------------------------------------------------------ >>> See everything from the browser to the database with AppDynamics >>> Get end-to-end visibility with application monitoring from AppDynamics >>> Isolate bottlenecks and diagnose root cause in seconds. >>> Start your free trial of AppDynamics Pro today! >>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>> _______________________________________________ >>> Languagetool-devel mailing list >>> Languagetool-devel@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >> >> ------------------------------------------------------------------------------ >> See everything from the browser to the database with AppDynamics >> Get end-to-end visibility with application monitoring from AppDynamics >> Isolate bottlenecks and diagnose root cause in seconds. >> Start your free trial of AppDynamics Pro today! >> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >> _______________________________________________ >> Languagetool-devel mailing list >> Languagetool-devel@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >> > > > ------------------------------------------------------------------------------ > See everything from the browser to the database with AppDynamics > Get end-to-end visibility with application monitoring from AppDynamics > Isolate bottlenecks and diagnose root cause in seconds. > Start your free trial of AppDynamics Pro today! > http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk > _______________________________________________ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > ------------------------------------------------------------------------------ See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel