Re: suggestions in Morfologik spelling rule

2013-07-16 Thread Marcin Miłkowski
W dniu 2013-07-16 00:03, Jaume Ortolà i Font pisze:
 2013/7/15 Marcin Miłkowski list-addr...@wp.pl:
 Hi Jaume,

 W dniu 2013-07-15 21:16, Jaume Ortolà i Font pisze:
 Hi, Marcin.

 I have tested the current code (1.8.0-SNAPSHOT) and everything is OK,
 all the changes are there. Thank you.

 Great. We'll release 1.7.1, this is just a minor bug fix.

 BTW, when you see something you want to fix, just make a fork on github
 to fix it, then file an issue, and then make a pull request associated
 with that issue. That way, it will be much easier to develop the library
 with your changes.

 I'll try to do it.

 Also, if you'll find time to use a proper way of removing duplicates
 (now we lose information from CandidateData that might be significant
 for something - I know this is me being fussy, this is quite clean).

 There are different ways to do it:
 - We could test for duplicates in addCandidate()...
 - candidates could be a Set, but then it needs to be converted to a
 List to be sorted...

Not really. We can use a TreeSet with a custom comparator:

http://stackoverflow.com/a/4165893


 If you want to keep the distance information outside Speller.java,
 that's a different a matter.


 The next step for improving the suggestions would be to use a list of
 frequent words. I'm thinking of just a list of manually selected words
 or at most a few thousand words from a frequency dictionary.

Yes. Frequency dictionaries would be very useful.

I think we can represent frequency classes as ten ranges of percentages 
with 10 ASCII characters (A-K), as this would be in the tradition of the 
fsa encoding. So A would be the most common words (like 'the' and 'a' 
in English), etc. I think we don't need to have a better resolution here.

Or we could simply use a numerical percentage in its decimal (rounded) 
representation from 000 to 100. This, however, would make the dictionary 
slightly bigger.

Regards,
Marcin


 Regards,
 Jaume


 Regards,
 Marcin


 Now we need a release with the changes, and we'll be able to adapt the 
 tests.

 Regards,
 Jaume
 Salutacions,
 Jaume Ortolà
 www.riuraueditors.cat



 2013/7/15 Marcin Miłkowski list-addr...@wp.pl:
 W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze:
 Thanks, Marcin.

 Some remarks. The improvements I sent to the list 15 days ago have not
 been added, and moreover I have found more bugs.
 I'm really sorry but there are 200 mails from the mailing list over the
 last two weeks and I have been away from my e-mail. Could you please add
 your changes as issues on github for morfologik-stemming? This way it
 would make it much easier for us to track these things.

 I attach the code I'm using now and explain briefly the reasons for the 
 changes.

 - In the getAllReplacements method we need to make sure that the
 replacements are done from left to right. We must complete the
 for-loop of the replacement pairs, choose the first possible
 replacement (form left to right) and then start the two new branches
 (with and without replacement). Otherwise, some replacements are not
 done.
 OK, this sounds OK. I integrated your changes.

 - If there is ss as a key in the replacement pairs, and somebody
 uses a long string of s (ss...) as input text, this can
 cause the method to consume all the memory, as the algorithm is
 exponential (2^(number of replacements)). This happened to us in an
 online server, and the LT server crashed. The depth of the recursive
 algorithm should be limited to 4 o 5 levels at most.
 Is that in getAllReplacements()?

 - It is possible that different words to check give the same
 suggestion. So at some point we need to remove duplicates. I do this
 at the end of findReplacements().
 You are right. We could probably write the same code in a slightly more
 elegant way, without converting this to a LinkedHashSet but simply by
 adding to a set when iterating the list.

 - The conditions around line 238 (current github version 1.7) are not
 correct. The first isInDictionary makes the lower case conversion
 useless:

if (isInDictionary(wordChecked)
 dictionaryMetadata.isConvertingCase()
 isMixedCase(wordChecked)

 isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale(

 I think they should be something like:

  if (isInDictionary(wordChecked)
  || (dictionaryMetadata.convertCase
   isMixedCase(wordChecked)
   isInDictionary(wordChecked
  .toLowerCase(dictionaryMetadata.dictionaryLocale
 Fixed!

 I tried to add your fixes but your code is now quite far away from ours,
 so diff does not give any meaningful output. Please review the code on
 github, and if needed, file an issue over changes that need to be done.

 Regards,
 Marcin

 Regards,
 Jaume Ortolà
 Salutacions,
 Jaume Ortolà
 www.riuraueditors.cat



 2013/7/15 Marcin Miłkowski 

Re: suggestions in Morfologik spelling rule

2013-07-16 Thread R.J. Baars
Coding word frequencies as a character is fine. I think it would be
classes, logarithmic as far as I am concerned.

Ruud

 W dniu 2013-07-16 00:03, Jaume Ortolà i Font pisze:
 2013/7/15 Marcin Miłkowski list-addr...@wp.pl:
 Hi Jaume,

 W dniu 2013-07-15 21:16, Jaume Ortolà i Font pisze:
 Hi, Marcin.

 I have tested the current code (1.8.0-SNAPSHOT) and everything is OK,
 all the changes are there. Thank you.

 Great. We'll release 1.7.1, this is just a minor bug fix.

 BTW, when you see something you want to fix, just make a fork on github
 to fix it, then file an issue, and then make a pull request associated
 with that issue. That way, it will be much easier to develop the
 library
 with your changes.

 I'll try to do it.

 Also, if you'll find time to use a proper way of removing duplicates
 (now we lose information from CandidateData that might be significant
 for something - I know this is me being fussy, this is quite clean).

 There are different ways to do it:
 - We could test for duplicates in addCandidate()...
 - candidates could be a Set, but then it needs to be converted to a
 List to be sorted...

 Not really. We can use a TreeSet with a custom comparator:

 http://stackoverflow.com/a/4165893


 If you want to keep the distance information outside Speller.java,
 that's a different a matter.


 The next step for improving the suggestions would be to use a list of
 frequent words. I'm thinking of just a list of manually selected words
 or at most a few thousand words from a frequency dictionary.

 Yes. Frequency dictionaries would be very useful.

 I think we can represent frequency classes as ten ranges of percentages
 with 10 ASCII characters (A-K), as this would be in the tradition of the
 fsa encoding. So A would be the most common words (like 'the' and 'a'
 in English), etc. I think we don't need to have a better resolution here.

 Or we could simply use a numerical percentage in its decimal (rounded)
 representation from 000 to 100. This, however, would make the dictionary
 slightly bigger.

 Regards,
 Marcin


 Regards,
 Jaume


 Regards,
 Marcin


 Now we need a release with the changes, and we'll be able to adapt the
 tests.

 Regards,
 Jaume
 Salutacions,
 Jaume Ortolà
 www.riuraueditors.cat



 2013/7/15 Marcin Miłkowski list-addr...@wp.pl:
 W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze:
 Thanks, Marcin.

 Some remarks. The improvements I sent to the list 15 days ago have
 not
 been added, and moreover I have found more bugs.
 I'm really sorry but there are 200 mails from the mailing list over
 the
 last two weeks and I have been away from my e-mail. Could you please
 add
 your changes as issues on github for morfologik-stemming? This way it
 would make it much easier for us to track these things.

 I attach the code I'm using now and explain briefly the reasons for
 the changes.

 - In the getAllReplacements method we need to make sure that the
 replacements are done from left to right. We must complete the
 for-loop of the replacement pairs, choose the first possible
 replacement (form left to right) and then start the two new branches
 (with and without replacement). Otherwise, some replacements are not
 done.
 OK, this sounds OK. I integrated your changes.

 - If there is ss as a key in the replacement pairs, and somebody
 uses a long string of s (ss...) as input text, this can
 cause the method to consume all the memory, as the algorithm is
 exponential (2^(number of replacements)). This happened to us in an
 online server, and the LT server crashed. The depth of the recursive
 algorithm should be limited to 4 o 5 levels at most.
 Is that in getAllReplacements()?

 - It is possible that different words to check give the same
 suggestion. So at some point we need to remove duplicates. I do this
 at the end of findReplacements().
 You are right. We could probably write the same code in a slightly
 more
 elegant way, without converting this to a LinkedHashSet but simply by
 adding to a set when iterating the list.

 - The conditions around line 238 (current github version 1.7) are
 not
 correct. The first isInDictionary makes the lower case conversion
 useless:

if (isInDictionary(wordChecked)

 dictionaryMetadata.isConvertingCase()
 isMixedCase(wordChecked)

 isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale(

 I think they should be something like:

  if (isInDictionary(wordChecked)
  || (dictionaryMetadata.convertCase
   isMixedCase(wordChecked)
   isInDictionary(wordChecked
  .toLowerCase(dictionaryMetadata.dictionaryLocale
 Fixed!

 I tried to add your fixes but your code is now quite far away from
 ours,
 so diff does not give any meaningful output. Please review the code
 on
 github, and if needed, file an issue over changes that need 

Re: suggestions in Morfologik spelling rule

2013-07-16 Thread Ruud Baars
By the way, I could help with words frequencies for some langauges.
e.g. Portuguese, Spanish, Dutch.

Ruud

On 16-07-13 14:20, R.J. Baars wrote:
 Coding word frequencies as a character is fine. I think it would be
 classes, logarithmic as far as I am concerned.

 Ruud

 W dniu 2013-07-16 00:03, Jaume Ortolà i Font pisze:
 2013/7/15 Marcin Miłkowski list-addr...@wp.pl:
 Hi Jaume,

 W dniu 2013-07-15 21:16, Jaume Ortolà i Font pisze:
 Hi, Marcin.

 I have tested the current code (1.8.0-SNAPSHOT) and everything is OK,
 all the changes are there. Thank you.
 Great. We'll release 1.7.1, this is just a minor bug fix.

 BTW, when you see something you want to fix, just make a fork on github
 to fix it, then file an issue, and then make a pull request associated
 with that issue. That way, it will be much easier to develop the
 library
 with your changes.
 I'll try to do it.

 Also, if you'll find time to use a proper way of removing duplicates
 (now we lose information from CandidateData that might be significant
 for something - I know this is me being fussy, this is quite clean).
 There are different ways to do it:
 - We could test for duplicates in addCandidate()...
 - candidates could be a Set, but then it needs to be converted to a
 List to be sorted...
 Not really. We can use a TreeSet with a custom comparator:

 http://stackoverflow.com/a/4165893

 If you want to keep the distance information outside Speller.java,
 that's a different a matter.


 The next step for improving the suggestions would be to use a list of
 frequent words. I'm thinking of just a list of manually selected words
 or at most a few thousand words from a frequency dictionary.
 Yes. Frequency dictionaries would be very useful.

 I think we can represent frequency classes as ten ranges of percentages
 with 10 ASCII characters (A-K), as this would be in the tradition of the
 fsa encoding. So A would be the most common words (like 'the' and 'a'
 in English), etc. I think we don't need to have a better resolution here.

 Or we could simply use a numerical percentage in its decimal (rounded)
 representation from 000 to 100. This, however, would make the dictionary
 slightly bigger.

 Regards,
 Marcin

 Regards,
 Jaume


 Regards,
 Marcin

 Now we need a release with the changes, and we'll be able to adapt the
 tests.

 Regards,
 Jaume
 Salutacions,
 Jaume OrtolÃ
 www.riuraueditors.cat



 2013/7/15 Marcin Miłkowski list-addr...@wp.pl:
 W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze:
 Thanks, Marcin.

 Some remarks. The improvements I sent to the list 15 days ago have
 not
 been added, and moreover I have found more bugs.
 I'm really sorry but there are 200 mails from the mailing list over
 the
 last two weeks and I have been away from my e-mail. Could you please
 add
 your changes as issues on github for morfologik-stemming? This way it
 would make it much easier for us to track these things.

 I attach the code I'm using now and explain briefly the reasons for
 the changes.

 - In the getAllReplacements method we need to make sure that the
 replacements are done from left to right. We must complete the
 for-loop of the replacement pairs, choose the first possible
 replacement (form left to right) and then start the two new branches
 (with and without replacement). Otherwise, some replacements are not
 done.
 OK, this sounds OK. I integrated your changes.

 - If there is ss as a key in the replacement pairs, and somebody
 uses a long string of s (ss...) as input text, this can
 cause the method to consume all the memory, as the algorithm is
 exponential (2^(number of replacements)). This happened to us in an
 online server, and the LT server crashed. The depth of the recursive
 algorithm should be limited to 4 o 5 levels at most.
 Is that in getAllReplacements()?

 - It is possible that different words to check give the same
 suggestion. So at some point we need to remove duplicates. I do this
 at the end of findReplacements().
 You are right. We could probably write the same code in a slightly
 more
 elegant way, without converting this to a LinkedHashSet but simply by
 adding to a set when iterating the list.

 - The conditions around line 238 (current github version 1.7) are
 not
 correct. The first isInDictionary makes the lower case conversion
 useless:

 if (isInDictionary(wordChecked)
 
 dictionaryMetadata.isConvertingCase()
  isMixedCase(wordChecked)
 
 isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale(

 I think they should be something like:

   if (isInDictionary(wordChecked)
   || (dictionaryMetadata.convertCase
isMixedCase(wordChecked)
isInDictionary(wordChecked
   
 .toLowerCase(dictionaryMetadata.dictionaryLocale
 Fixed!

 I tried to add your fixes but your code is now quite far 

Re: suggestions in Morfologik spelling rule

2013-07-15 Thread Marcin Miłkowski
Hi,

Dawid just released morfologik 1.7 on Maven. So we can actually go on 
and include a newer version in LT.

The new version still does not support compounding but it has all the 
features required for getting better diacritic suggestions.

Best,
Marcin

W dniu 2013-07-02 08:59, Marcin Miłkowski pisze:
 W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze:
 Hi Marcin,

 I have been using the still unreleased code of morfologik-stemming and I
 have made improvements to Speller.java for some previously unforseen
 cases. See the attachement.

 In order to complete the development, and test  debug with all
 languages, perhaps we could include temporarily the morfologik module
 inside LanguageTool. This will make thinks easier. What do yo think?

 No. I should make a release, forking morfologik makes no sense to me.

 The only thing that stops me is the lack of time to work on compounds.

 Best,
 Marcin

 --
 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 http://p.sf.net/sfu/windows-dev2dev
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel



--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: suggestions in Morfologik spelling rule

2013-07-15 Thread Marcin Miłkowski
W dniu 2013-07-15 10:56, Marcin Miłkowski pisze:
 Hi,

 Dawid just released morfologik 1.7 on Maven. So we can actually go on
 and include a newer version in LT.

 The new version still does not support compounding but it has all the
 features required for getting better diacritic suggestions.

Here's the documentation:

http://wiki.languagetool.org/hunspell-support#toc5

Best,
Marcin


 Best,
 Marcin

 W dniu 2013-07-02 08:59, Marcin Miłkowski pisze:
 W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze:
 Hi Marcin,

 I have been using the still unreleased code of morfologik-stemming and I
 have made improvements to Speller.java for some previously unforseen
 cases. See the attachement.

 In order to complete the development, and test  debug with all
 languages, perhaps we could include temporarily the morfologik module
 inside LanguageTool. This will make thinks easier. What do yo think?

 No. I should make a release, forking morfologik makes no sense to me.

 The only thing that stops me is the lack of time to work on compounds.

 Best,
 Marcin

 --

 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 http://p.sf.net/sfu/windows-dev2dev
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: suggestions in Morfologik spelling rule

2013-07-15 Thread Marcin Miłkowski
Hi again,

I had to adjust some Catalan and German tests for MorfologikSpeller. For 
German, I also added some values in one of the dictionaries so that 
better suggestions are now found.

Please review my changes.

Best regards,
Marcin

W dniu 2013-07-15 11:27, Marcin Miłkowski pisze:
 W dniu 2013-07-15 10:56, Marcin Miłkowski pisze:
 Hi,

 Dawid just released morfologik 1.7 on Maven. So we can actually go on
 and include a newer version in LT.

 The new version still does not support compounding but it has all the
 features required for getting better diacritic suggestions.

 Here's the documentation:

 http://wiki.languagetool.org/hunspell-support#toc5

 Best,
 Marcin


 Best,
 Marcin

 W dniu 2013-07-02 08:59, Marcin Miłkowski pisze:
 W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze:
 Hi Marcin,

 I have been using the still unreleased code of morfologik-stemming and I
 have made improvements to Speller.java for some previously unforseen
 cases. See the attachement.

 In order to complete the development, and test  debug with all
 languages, perhaps we could include temporarily the morfologik module
 inside LanguageTool. This will make thinks easier. What do yo think?

 No. I should make a release, forking morfologik makes no sense to me.

 The only thing that stops me is the lack of time to work on compounds.

 Best,
 Marcin

 --

 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 http://p.sf.net/sfu/windows-dev2dev
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




 --
 See everything from the browser to the database with AppDynamics
 Get end-to-end visibility with application monitoring from AppDynamics
 Isolate bottlenecks and diagnose root cause in seconds.
 Start your free trial of AppDynamics Pro today!
 http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel



--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: suggestions in Morfologik spelling rule

2013-07-15 Thread Jaume Ortolà i Font
Thanks, Marcin.

Some remarks. The improvements I sent to the list 15 days ago have not
been added, and moreover I have found more bugs.

I attach the code I'm using now and explain briefly the reasons for the changes.

- In the getAllReplacements method we need to make sure that the
replacements are done from left to right. We must complete the
for-loop of the replacement pairs, choose the first possible
replacement (form left to right) and then start the two new branches
(with and without replacement). Otherwise, some replacements are not
done.

- If there is ss as a key in the replacement pairs, and somebody
uses a long string of s (ss...) as input text, this can
cause the method to consume all the memory, as the algorithm is
exponential (2^(number of replacements)). This happened to us in an
online server, and the LT server crashed. The depth of the recursive
algorithm should be limited to 4 o 5 levels at most.

- It is possible that different words to check give the same
suggestion. So at some point we need to remove duplicates. I do this
at the end of findReplacements().

- The conditions around line 238 (current github version 1.7) are not
correct. The first isInDictionary makes the lower case conversion
useless:

if (isInDictionary(wordChecked)
 dictionaryMetadata.isConvertingCase()
 isMixedCase(wordChecked)

isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale(

I think they should be something like:

  if (isInDictionary(wordChecked)
  || (dictionaryMetadata.convertCase
   isMixedCase(wordChecked)
   isInDictionary(wordChecked
  .toLowerCase(dictionaryMetadata.dictionaryLocale


Regards,
Jaume Ortolà
Salutacions,
Jaume Ortolà
www.riuraueditors.cat



2013/7/15 Marcin Miłkowski list-addr...@wp.pl:
 W dniu 2013-07-15 10:56, Marcin Miłkowski pisze:
 Hi,

 Dawid just released morfologik 1.7 on Maven. So we can actually go on
 and include a newer version in LT.

 The new version still does not support compounding but it has all the
 features required for getting better diacritic suggestions.

 Here's the documentation:

 http://wiki.languagetool.org/hunspell-support#toc5

 Best,
 Marcin


 Best,
 Marcin

 W dniu 2013-07-02 08:59, Marcin Miłkowski pisze:
 W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze:
 Hi Marcin,

 I have been using the still unreleased code of morfologik-stemming and I
 have made improvements to Speller.java for some previously unforseen
 cases. See the attachement.

 In order to complete the development, and test  debug with all
 languages, perhaps we could include temporarily the morfologik module
 inside LanguageTool. This will make thinks easier. What do yo think?

 No. I should make a release, forking morfologik makes no sense to me.

 The only thing that stops me is the lack of time to work on compounds.

 Best,
 Marcin

 --

 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 http://p.sf.net/sfu/windows-dev2dev
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




 --
 See everything from the browser to the database with AppDynamics
 Get end-to-end visibility with application monitoring from AppDynamics
 Isolate bottlenecks and diagnose root cause in seconds.
 Start your free trial of AppDynamics Pro today!
 http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Speller.java
Description: Binary data
--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: suggestions in Morfologik spelling rule

2013-07-15 Thread Daniel Naber
Am 15.07.2013 12:35, schrieb Marcin Miłkowski:

 Please review my changes.

+assertCorrectionsByOrder(rule, Rytmus, Remus, Rhythmus);

This new suggestion is not as good as the old one, Rhythmus should be 
preferred. As this is a classical/typical mistake, could we just list it 
somewhere? Like Rytmus - Rhythmus?

Regards
  Daniel

-- 
http://www.danielnaber.de

--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: suggestions in Morfologik spelling rule

2013-07-15 Thread Jaume Ortolà i Font
2013/7/15 Daniel Naber list2...@danielnaber.de:
 Am 15.07.2013 12:35, schrieb Marcin Miłkowski:

 Please review my changes.

 +assertCorrectionsByOrder(rule, Rytmus, Remus, Rhythmus);

 This new suggestion is not as good as the old one, Rhythmus should be
 preferred. As this is a classical/typical mistake, could we just list it
 somewhere? Like Rytmus - Rhythmus?


With the right replacement pairs, Rhythmus comes first as expected.
They should be something like:

fsa.dict.speller.replacement-pairs=ss ß,ae ä,oe ö,ue ü,R Rh,r rh,t th

Of course, the list can be expanded...

Regards,
Jaume

--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: suggestions in Morfologik spelling rule

2013-07-15 Thread Marcin Miłkowski
W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze:
 Thanks, Marcin.

 Some remarks. The improvements I sent to the list 15 days ago have not
 been added, and moreover I have found more bugs.

I'm really sorry but there are 200 mails from the mailing list over the 
last two weeks and I have been away from my e-mail. Could you please add 
your changes as issues on github for morfologik-stemming? This way it 
would make it much easier for us to track these things.


 I attach the code I'm using now and explain briefly the reasons for the 
 changes.

 - In the getAllReplacements method we need to make sure that the
 replacements are done from left to right. We must complete the
 for-loop of the replacement pairs, choose the first possible
 replacement (form left to right) and then start the two new branches
 (with and without replacement). Otherwise, some replacements are not
 done.

OK, this sounds OK. I integrated your changes.

 - If there is ss as a key in the replacement pairs, and somebody
 uses a long string of s (ss...) as input text, this can
 cause the method to consume all the memory, as the algorithm is
 exponential (2^(number of replacements)). This happened to us in an
 online server, and the LT server crashed. The depth of the recursive
 algorithm should be limited to 4 o 5 levels at most.

Is that in getAllReplacements()?

 - It is possible that different words to check give the same
 suggestion. So at some point we need to remove duplicates. I do this
 at the end of findReplacements().

You are right. We could probably write the same code in a slightly more 
elegant way, without converting this to a LinkedHashSet but simply by 
adding to a set when iterating the list.


 - The conditions around line 238 (current github version 1.7) are not
 correct. The first isInDictionary makes the lower case conversion
 useless:

  if (isInDictionary(wordChecked)
   dictionaryMetadata.isConvertingCase()
   isMixedCase(wordChecked)
  
 isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale(

 I think they should be something like:

if (isInDictionary(wordChecked)
|| (dictionaryMetadata.convertCase
 isMixedCase(wordChecked)
 isInDictionary(wordChecked
.toLowerCase(dictionaryMetadata.dictionaryLocale

Fixed!

I tried to add your fixes but your code is now quite far away from ours, 
so diff does not give any meaningful output. Please review the code on 
github, and if needed, file an issue over changes that need to be done.

Regards,
Marcin


 Regards,
 Jaume Ortolà
 Salutacions,
 Jaume Ortolà
 www.riuraueditors.cat



 2013/7/15 Marcin Miłkowski list-addr...@wp.pl:
 W dniu 2013-07-15 10:56, Marcin Miłkowski pisze:
 Hi,

 Dawid just released morfologik 1.7 on Maven. So we can actually go on
 and include a newer version in LT.

 The new version still does not support compounding but it has all the
 features required for getting better diacritic suggestions.

 Here's the documentation:

 http://wiki.languagetool.org/hunspell-support#toc5

 Best,
 Marcin


 Best,
 Marcin

 W dniu 2013-07-02 08:59, Marcin Miłkowski pisze:
 W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze:
 Hi Marcin,

 I have been using the still unreleased code of morfologik-stemming and I
 have made improvements to Speller.java for some previously unforseen
 cases. See the attachement.

 In order to complete the development, and test  debug with all
 languages, perhaps we could include temporarily the morfologik module
 inside LanguageTool. This will make thinks easier. What do yo think?

 No. I should make a release, forking morfologik makes no sense to me.

 The only thing that stops me is the lack of time to work on compounds.

 Best,
 Marcin

 --

 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 http://p.sf.net/sfu/windows-dev2dev
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




 --
 See everything from the browser to the database with AppDynamics
 Get end-to-end visibility with application monitoring from AppDynamics
 Isolate bottlenecks and diagnose root cause in seconds.
 Start your free trial of AppDynamics Pro today!
 http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel


 --
 See everything from the browser to the database with 

Re: suggestions in Morfologik spelling rule

2013-07-15 Thread Jaume Ortolà i Font
Hi, Marcin.

I have tested the current code (1.8.0-SNAPSHOT) and everything is OK,
all the changes are there. Thank you.

Now we need a release with the changes, and we'll be able to adapt the tests.

Regards,
Jaume
Salutacions,
Jaume Ortolà
www.riuraueditors.cat



2013/7/15 Marcin Miłkowski list-addr...@wp.pl:
 W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze:
 Thanks, Marcin.

 Some remarks. The improvements I sent to the list 15 days ago have not
 been added, and moreover I have found more bugs.

 I'm really sorry but there are 200 mails from the mailing list over the
 last two weeks and I have been away from my e-mail. Could you please add
 your changes as issues on github for morfologik-stemming? This way it
 would make it much easier for us to track these things.


 I attach the code I'm using now and explain briefly the reasons for the 
 changes.

 - In the getAllReplacements method we need to make sure that the
 replacements are done from left to right. We must complete the
 for-loop of the replacement pairs, choose the first possible
 replacement (form left to right) and then start the two new branches
 (with and without replacement). Otherwise, some replacements are not
 done.

 OK, this sounds OK. I integrated your changes.

 - If there is ss as a key in the replacement pairs, and somebody
 uses a long string of s (ss...) as input text, this can
 cause the method to consume all the memory, as the algorithm is
 exponential (2^(number of replacements)). This happened to us in an
 online server, and the LT server crashed. The depth of the recursive
 algorithm should be limited to 4 o 5 levels at most.

 Is that in getAllReplacements()?

 - It is possible that different words to check give the same
 suggestion. So at some point we need to remove duplicates. I do this
 at the end of findReplacements().

 You are right. We could probably write the same code in a slightly more
 elegant way, without converting this to a LinkedHashSet but simply by
 adding to a set when iterating the list.


 - The conditions around line 238 (current github version 1.7) are not
 correct. The first isInDictionary makes the lower case conversion
 useless:

  if (isInDictionary(wordChecked)
   dictionaryMetadata.isConvertingCase()
   isMixedCase(wordChecked)
  
 isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale(

 I think they should be something like:

if (isInDictionary(wordChecked)
|| (dictionaryMetadata.convertCase
 isMixedCase(wordChecked)
 isInDictionary(wordChecked
.toLowerCase(dictionaryMetadata.dictionaryLocale

 Fixed!

 I tried to add your fixes but your code is now quite far away from ours,
 so diff does not give any meaningful output. Please review the code on
 github, and if needed, file an issue over changes that need to be done.

 Regards,
 Marcin


 Regards,
 Jaume Ortolà
 Salutacions,
 Jaume Ortolà
 www.riuraueditors.cat



 2013/7/15 Marcin Miłkowski list-addr...@wp.pl:
 W dniu 2013-07-15 10:56, Marcin Miłkowski pisze:
 Hi,

 Dawid just released morfologik 1.7 on Maven. So we can actually go on
 and include a newer version in LT.

 The new version still does not support compounding but it has all the
 features required for getting better diacritic suggestions.

 Here's the documentation:

 http://wiki.languagetool.org/hunspell-support#toc5

 Best,
 Marcin


 Best,
 Marcin

 W dniu 2013-07-02 08:59, Marcin Miłkowski pisze:
 W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze:
 Hi Marcin,

 I have been using the still unreleased code of morfologik-stemming and I
 have made improvements to Speller.java for some previously unforseen
 cases. See the attachement.

 In order to complete the development, and test  debug with all
 languages, perhaps we could include temporarily the morfologik module
 inside LanguageTool. This will make thinks easier. What do yo think?

 No. I should make a release, forking morfologik makes no sense to me.

 The only thing that stops me is the lack of time to work on compounds.

 Best,
 Marcin

 --

 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 http://p.sf.net/sfu/windows-dev2dev
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




 --
 See everything from the browser to the database with AppDynamics
 Get end-to-end visibility with application monitoring from AppDynamics
 Isolate bottlenecks and diagnose root cause in seconds.
 Start your free trial of AppDynamics Pro today!
 http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
 

Re: suggestions in Morfologik spelling rule

2013-07-15 Thread Marcin Miłkowski
Hi Jaume,

W dniu 2013-07-15 21:16, Jaume Ortolà i Font pisze:
 Hi, Marcin.

 I have tested the current code (1.8.0-SNAPSHOT) and everything is OK,
 all the changes are there. Thank you.

Great. We'll release 1.7.1, this is just a minor bug fix.

BTW, when you see something you want to fix, just make a fork on github 
to fix it, then file an issue, and then make a pull request associated 
with that issue. That way, it will be much easier to develop the library 
with your changes.

Also, if you'll find time to use a proper way of removing duplicates 
(now we lose information from CandidateData that might be significant 
for something - I know this is me being fussy, this is quite clean).

Regards,
Marcin


 Now we need a release with the changes, and we'll be able to adapt the tests.

 Regards,
 Jaume
 Salutacions,
 Jaume Ortolà
 www.riuraueditors.cat



 2013/7/15 Marcin Miłkowski list-addr...@wp.pl:
 W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze:
 Thanks, Marcin.

 Some remarks. The improvements I sent to the list 15 days ago have not
 been added, and moreover I have found more bugs.
 I'm really sorry but there are 200 mails from the mailing list over the
 last two weeks and I have been away from my e-mail. Could you please add
 your changes as issues on github for morfologik-stemming? This way it
 would make it much easier for us to track these things.

 I attach the code I'm using now and explain briefly the reasons for the 
 changes.

 - In the getAllReplacements method we need to make sure that the
 replacements are done from left to right. We must complete the
 for-loop of the replacement pairs, choose the first possible
 replacement (form left to right) and then start the two new branches
 (with and without replacement). Otherwise, some replacements are not
 done.
 OK, this sounds OK. I integrated your changes.

 - If there is ss as a key in the replacement pairs, and somebody
 uses a long string of s (ss...) as input text, this can
 cause the method to consume all the memory, as the algorithm is
 exponential (2^(number of replacements)). This happened to us in an
 online server, and the LT server crashed. The depth of the recursive
 algorithm should be limited to 4 o 5 levels at most.
 Is that in getAllReplacements()?

 - It is possible that different words to check give the same
 suggestion. So at some point we need to remove duplicates. I do this
 at the end of findReplacements().
 You are right. We could probably write the same code in a slightly more
 elegant way, without converting this to a LinkedHashSet but simply by
 adding to a set when iterating the list.

 - The conditions around line 238 (current github version 1.7) are not
 correct. The first isInDictionary makes the lower case conversion
 useless:

   if (isInDictionary(wordChecked)
dictionaryMetadata.isConvertingCase()
isMixedCase(wordChecked)
   
 isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale(

 I think they should be something like:

 if (isInDictionary(wordChecked)
 || (dictionaryMetadata.convertCase
  isMixedCase(wordChecked)
  isInDictionary(wordChecked
 .toLowerCase(dictionaryMetadata.dictionaryLocale
 Fixed!

 I tried to add your fixes but your code is now quite far away from ours,
 so diff does not give any meaningful output. Please review the code on
 github, and if needed, file an issue over changes that need to be done.

 Regards,
 Marcin

 Regards,
 Jaume Ortolà
 Salutacions,
 Jaume Ortolà
 www.riuraueditors.cat



 2013/7/15 Marcin Miłkowski list-addr...@wp.pl:
 W dniu 2013-07-15 10:56, Marcin Miłkowski pisze:
 Hi,

 Dawid just released morfologik 1.7 on Maven. So we can actually go on
 and include a newer version in LT.

 The new version still does not support compounding but it has all the
 features required for getting better diacritic suggestions.
 Here's the documentation:

 http://wiki.languagetool.org/hunspell-support#toc5

 Best,
 Marcin


 Best,
 Marcin

 W dniu 2013-07-02 08:59, Marcin Miłkowski pisze:
 W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze:
 Hi Marcin,

 I have been using the still unreleased code of morfologik-stemming and I
 have made improvements to Speller.java for some previously unforseen
 cases. See the attachement.

 In order to complete the development, and test  debug with all
 languages, perhaps we could include temporarily the morfologik module
 inside LanguageTool. This will make thinks easier. What do yo think?
 No. I should make a release, forking morfologik makes no sense to me.

 The only thing that stops me is the lack of time to work on compounds.

 Best,
 Marcin

 --

 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 

Re: suggestions in Morfologik spelling rule

2013-07-15 Thread Jaume Ortolà i Font
2013/7/15 Marcin Miłkowski list-addr...@wp.pl:
 Hi Jaume,

 W dniu 2013-07-15 21:16, Jaume Ortolà i Font pisze:
 Hi, Marcin.

 I have tested the current code (1.8.0-SNAPSHOT) and everything is OK,
 all the changes are there. Thank you.

 Great. We'll release 1.7.1, this is just a minor bug fix.

 BTW, when you see something you want to fix, just make a fork on github
 to fix it, then file an issue, and then make a pull request associated
 with that issue. That way, it will be much easier to develop the library
 with your changes.

I'll try to do it.

 Also, if you'll find time to use a proper way of removing duplicates
 (now we lose information from CandidateData that might be significant
 for something - I know this is me being fussy, this is quite clean).

There are different ways to do it:
- We could test for duplicates in addCandidate()...
- candidates could be a Set, but then it needs to be converted to a
List to be sorted...

If you want to keep the distance information outside Speller.java,
that's a different a matter.


The next step for improving the suggestions would be to use a list of
frequent words. I'm thinking of just a list of manually selected words
or at most a few thousand words from a frequency dictionary.

Regards,
Jaume


 Regards,
 Marcin


 Now we need a release with the changes, and we'll be able to adapt the tests.

 Regards,
 Jaume
 Salutacions,
 Jaume Ortolà
 www.riuraueditors.cat



 2013/7/15 Marcin Miłkowski list-addr...@wp.pl:
 W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze:
 Thanks, Marcin.

 Some remarks. The improvements I sent to the list 15 days ago have not
 been added, and moreover I have found more bugs.
 I'm really sorry but there are 200 mails from the mailing list over the
 last two weeks and I have been away from my e-mail. Could you please add
 your changes as issues on github for morfologik-stemming? This way it
 would make it much easier for us to track these things.

 I attach the code I'm using now and explain briefly the reasons for the 
 changes.

 - In the getAllReplacements method we need to make sure that the
 replacements are done from left to right. We must complete the
 for-loop of the replacement pairs, choose the first possible
 replacement (form left to right) and then start the two new branches
 (with and without replacement). Otherwise, some replacements are not
 done.
 OK, this sounds OK. I integrated your changes.

 - If there is ss as a key in the replacement pairs, and somebody
 uses a long string of s (ss...) as input text, this can
 cause the method to consume all the memory, as the algorithm is
 exponential (2^(number of replacements)). This happened to us in an
 online server, and the LT server crashed. The depth of the recursive
 algorithm should be limited to 4 o 5 levels at most.
 Is that in getAllReplacements()?

 - It is possible that different words to check give the same
 suggestion. So at some point we need to remove duplicates. I do this
 at the end of findReplacements().
 You are right. We could probably write the same code in a slightly more
 elegant way, without converting this to a LinkedHashSet but simply by
 adding to a set when iterating the list.

 - The conditions around line 238 (current github version 1.7) are not
 correct. The first isInDictionary makes the lower case conversion
 useless:

   if (isInDictionary(wordChecked)
dictionaryMetadata.isConvertingCase()
isMixedCase(wordChecked)
   
 isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale(

 I think they should be something like:

 if (isInDictionary(wordChecked)
 || (dictionaryMetadata.convertCase
  isMixedCase(wordChecked)
  isInDictionary(wordChecked
 .toLowerCase(dictionaryMetadata.dictionaryLocale
 Fixed!

 I tried to add your fixes but your code is now quite far away from ours,
 so diff does not give any meaningful output. Please review the code on
 github, and if needed, file an issue over changes that need to be done.

 Regards,
 Marcin

 Regards,
 Jaume Ortolà
 Salutacions,
 Jaume Ortolà
 www.riuraueditors.cat



 2013/7/15 Marcin Miłkowski list-addr...@wp.pl:
 W dniu 2013-07-15 10:56, Marcin Miłkowski pisze:
 Hi,

 Dawid just released morfologik 1.7 on Maven. So we can actually go on
 and include a newer version in LT.

 The new version still does not support compounding but it has all the
 features required for getting better diacritic suggestions.
 Here's the documentation:

 http://wiki.languagetool.org/hunspell-support#toc5

 Best,
 Marcin


 Best,
 Marcin

 W dniu 2013-07-02 08:59, Marcin Miłkowski pisze:
 W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze:
 Hi Marcin,

 I have been using the still unreleased code of morfologik-stemming and 
 I
 have made improvements to Speller.java for some previously unforseen
 cases. See