Coding word frequencies as a character is fine. I think it would be
classes, logarithmic as far as I am concerned.

Ruud

> W dniu 2013-07-16 00:03, Jaume Ortolà i Font pisze:
>> 2013/7/15 Marcin Miłkowski <list-addr...@wp.pl>:
>>> Hi Jaume,
>>>
>>> W dniu 2013-07-15 21:16, Jaume Ortolà i Font pisze:
>>>> Hi, Marcin.
>>>>
>>>> I have tested the current code (1.8.0-SNAPSHOT) and everything is OK,
>>>> all the changes are there. Thank you.
>>>
>>> Great. We'll release 1.7.1, this is just a minor bug fix.
>>>
>>> BTW, when you see something you want to fix, just make a fork on github
>>> to fix it, then file an issue, and then make a pull request associated
>>> with that issue. That way, it will be much easier to develop the
>>> library
>>> with your changes.
>>
>> I'll try to do it.
>>
>>> Also, if you'll find time to use a proper way of removing duplicates
>>> (now we lose information from CandidateData that might be significant
>>> for something - I know this is me being fussy, this is quite clean).
>>
>> There are different ways to do it:
>> - We could test for duplicates in addCandidate()...
>> - "candidates" could be a Set, but then it needs to be converted to a
>> List to be sorted...
>
> Not really. We can use a TreeSet with a custom comparator:
>
> http://stackoverflow.com/a/4165893
>
>>
>> If you want to keep the distance information outside Speller.java,
>> that's a different a matter.
>>
>>
>> The next step for improving the suggestions would be to use a list of
>> frequent words. I'm thinking of just a list of manually selected words
>> or at most a few thousand words from a frequency dictionary.
>
> Yes. Frequency dictionaries would be very useful.
>
> I think we can represent frequency classes as ten ranges of percentages
> with 10 ASCII characters (A-K), as this would be in the tradition of the
> fsa encoding. So "A" would be the most common words (like 'the' and 'a'
> in English), etc. I think we don't need to have a better resolution here.
>
> Or we could simply use a numerical percentage in its decimal (rounded)
> representation from 000 to 100. This, however, would make the dictionary
> slightly bigger.
>
> Regards,
> Marcin
>
>>
>> Regards,
>> Jaume
>>
>>
>>> Regards,
>>> Marcin
>>>
>>>>
>>>> Now we need a release with the changes, and we'll be able to adapt the
>>>> tests.
>>>>
>>>> Regards,
>>>> Jaume
>>>> Salutacions,
>>>> Jaume Ortolà
>>>> www.riuraueditors.cat
>>>>
>>>>
>>>>
>>>> 2013/7/15 Marcin Miłkowski <list-addr...@wp.pl>:
>>>>> W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze:
>>>>>> Thanks, Marcin.
>>>>>>
>>>>>> Some remarks. The improvements I sent to the list 15 days ago have
>>>>>> not
>>>>>> been added, and moreover I have found more bugs.
>>>>> I'm really sorry but there are 200 mails from the mailing list over
>>>>> the
>>>>> last two weeks and I have been away from my e-mail. Could you please
>>>>> add
>>>>> your changes as issues on github for morfologik-stemming? This way it
>>>>> would make it much easier for us to track these things.
>>>>>
>>>>>> I attach the code I'm using now and explain briefly the reasons for
>>>>>> the changes.
>>>>>>
>>>>>> - In the getAllReplacements method we need to make sure that the
>>>>>> replacements are done from left to right. We must complete the
>>>>>> for-loop of the replacement pairs, choose the first possible
>>>>>> replacement (form left to right) and then start the two new branches
>>>>>> (with and without replacement). Otherwise, some replacements are not
>>>>>> done.
>>>>> OK, this sounds OK. I integrated your changes.
>>>>>
>>>>>> - If there is "ss" as a key in the replacement pairs, and somebody
>>>>>> uses a long string of s ("ssssssssss...") as input text, this can
>>>>>> cause the method to consume all the memory, as the algorithm is
>>>>>> exponential (2^(number of replacements)). This happened to us in an
>>>>>> online server, and the LT server crashed. The depth of the recursive
>>>>>> algorithm should be limited to 4 o 5 levels at most.
>>>>> Is that in getAllReplacements()?
>>>>>
>>>>>> - It is possible that different "words to check" give the same
>>>>>> suggestion. So at some point we need to remove duplicates. I do this
>>>>>> at the end of findReplacements().
>>>>> You are right. We could probably write the same code in a slightly
>>>>> more
>>>>> elegant way, without converting this to a LinkedHashSet but simply by
>>>>> adding to a set when iterating the list.
>>>>>
>>>>>> - The conditions around line 238 (current github version 1.7) are
>>>>>> not
>>>>>> correct. The first isInDictionary makes the lower case conversion
>>>>>> useless:
>>>>>>
>>>>>>                        if (isInDictionary(wordChecked)
>>>>>>                                &&
>>>>>> dictionaryMetadata.isConvertingCase()
>>>>>>                                && isMixedCase(wordChecked)
>>>>>>                                &&
>>>>>> isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale())))
>>>>>>
>>>>>> I think they should be something like:
>>>>>>
>>>>>>              if (isInDictionary(wordChecked)
>>>>>>                  || (dictionaryMetadata.convertCase
>>>>>>                  && isMixedCase(wordChecked)
>>>>>>                  && isInDictionary(wordChecked
>>>>>>                      .toLowerCase(dictionaryMetadata.dictionaryLocale))))
>>>>> Fixed!
>>>>>
>>>>> I tried to add your fixes but your code is now quite far away from
>>>>> ours,
>>>>> so diff does not give any meaningful output. Please review the code
>>>>> on
>>>>> github, and if needed, file an issue over changes that need to be
>>>>> done.
>>>>>
>>>>> Regards,
>>>>> Marcin
>>>>>
>>>>>> Regards,
>>>>>> Jaume Ortolà
>>>>>> Salutacions,
>>>>>> Jaume Ortolà
>>>>>> www.riuraueditors.cat
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2013/7/15 Marcin Miłkowski <list-addr...@wp.pl>:
>>>>>>> W dniu 2013-07-15 10:56, Marcin Miłkowski pisze:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Dawid just released morfologik 1.7 on Maven. So we can actually go
>>>>>>>> on
>>>>>>>> and include a newer version in LT.
>>>>>>>>
>>>>>>>> The new version still does not support compounding but it has all
>>>>>>>> the
>>>>>>>> features required for getting better diacritic suggestions.
>>>>>>> Here's the documentation:
>>>>>>>
>>>>>>> http://wiki.languagetool.org/hunspell-support#toc5
>>>>>>>
>>>>>>> Best,
>>>>>>> Marcin
>>>>>>>
>>>>>>>
>>>>>>>> Best,
>>>>>>>> Marcin
>>>>>>>>
>>>>>>>> W dniu 2013-07-02 08:59, Marcin Miłkowski pisze:
>>>>>>>>> W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze:
>>>>>>>>>> Hi Marcin,
>>>>>>>>>>
>>>>>>>>>> I have been using the still unreleased code of
>>>>>>>>>> morfologik-stemming and I
>>>>>>>>>> have made improvements to Speller.java for some previously
>>>>>>>>>> unforseen
>>>>>>>>>> cases. See the attachement.
>>>>>>>>>>
>>>>>>>>>> In order to complete the development, and test & debug with all
>>>>>>>>>> languages, perhaps we could include temporarily the morfologik
>>>>>>>>>> module
>>>>>>>>>> inside LanguageTool. This will make thinks easier. What do yo
>>>>>>>>>> think?
>>>>>>>>> No. I should make a release, forking morfologik makes no sense to
>>>>>>>>> me.
>>>>>>>>>
>>>>>>>>> The only thing that stops me is the lack of time to work on
>>>>>>>>> compounds.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Marcin
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> This SF.net email is sponsored by Windows:
>>>>>>>>>
>>>>>>>>> Build for Windows Store.
>>>>>>>>>
>>>>>>>>> http://p.sf.net/sfu/windows-dev2dev
>>>>>>>>> _______________________________________________
>>>>>>>>> Languagetool-devel mailing list
>>>>>>>>> Languagetool-devel@lists.sourceforge.net
>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> See everything from the browser to the database with AppDynamics
>>>>>>> Get end-to-end visibility with application monitoring from
>>>>>>> AppDynamics
>>>>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>>>>> Start your free trial of AppDynamics Pro today!
>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>>>>> _______________________________________________
>>>>>>> Languagetool-devel mailing list
>>>>>>> Languagetool-devel@lists.sourceforge.net
>>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> See everything from the browser to the database with AppDynamics
>>>>>>> Get end-to-end visibility with application monitoring from
>>>>>>> AppDynamics
>>>>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>>>>> Start your free trial of AppDynamics Pro today!
>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Languagetool-devel mailing list
>>>>>>> Languagetool-devel@lists.sourceforge.net
>>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> See everything from the browser to the database with AppDynamics
>>>>> Get end-to-end visibility with application monitoring from
>>>>> AppDynamics
>>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>>> Start your free trial of AppDynamics Pro today!
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>>> _______________________________________________
>>>>> Languagetool-devel mailing list
>>>>> Languagetool-devel@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>> ------------------------------------------------------------------------------
>>>> See everything from the browser to the database with AppDynamics
>>>> Get end-to-end visibility with application monitoring from AppDynamics
>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>> Start your free trial of AppDynamics Pro today!
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Languagetool-devel mailing list
>>>> Languagetool-devel@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> See everything from the browser to the database with AppDynamics
>>> Get end-to-end visibility with application monitoring from AppDynamics
>>> Isolate bottlenecks and diagnose root cause in seconds.
>>> Start your free trial of AppDynamics Pro today!
>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Languagetool-devel mailing list
>>> Languagetool-devel@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>
>> ------------------------------------------------------------------------------
>> See everything from the browser to the database with AppDynamics
>> Get end-to-end visibility with application monitoring from AppDynamics
>> Isolate bottlenecks and diagnose root cause in seconds.
>> Start your free trial of AppDynamics Pro today!
>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Languagetool-devel mailing list
>> Languagetool-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>
>
>
> ------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>



------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to