Re: suggestions in Morfologik spelling rule
Speaking of frequency lists, could we use Google n-grams? The license is Creative Commons Attribution 3.0 Unported License. I don't know how this would apply to a derivative work -- hunspell dictionary, basically LGPL + MPL, plus this one = ? Marcin W dniu 2013-07-16 16:32, Ruud Baars pisze: > By the way, I could help with words frequencies for some langauges. > e.g. Portuguese, Spanish, Dutch. > > Ruud > > On 16-07-13 14:20, R.J. Baars wrote: >> Coding word frequencies as a character is fine. I think it would be >> classes, logarithmic as far as I am concerned. >> >> Ruud >> >>> W dniu 2013-07-16 00:03, Jaume Ortolà i Font pisze: 2013/7/15 Marcin Miłkowski : > Hi Jaume, > > W dniu 2013-07-15 21:16, Jaume Ortolà i Font pisze: >> Hi, Marcin. >> >> I have tested the current code (1.8.0-SNAPSHOT) and everything is OK, >> all the changes are there. Thank you. > Great. We'll release 1.7.1, this is just a minor bug fix. > > BTW, when you see something you want to fix, just make a fork on github > to fix it, then file an issue, and then make a pull request associated > with that issue. That way, it will be much easier to develop the > library > with your changes. I'll try to do it. > Also, if you'll find time to use a proper way of removing duplicates > (now we lose information from CandidateData that might be significant > for something - I know this is me being fussy, this is quite clean). There are different ways to do it: - We could test for duplicates in addCandidate()... - "candidates" could be a Set, but then it needs to be converted to a List to be sorted... >>> Not really. We can use a TreeSet with a custom comparator: >>> >>> http://stackoverflow.com/a/4165893 >>> If you want to keep the distance information outside Speller.java, that's a different a matter. The next step for improving the suggestions would be to use a list of frequent words. I'm thinking of just a list of manually selected words or at most a few thousand words from a frequency dictionary. >>> Yes. Frequency dictionaries would be very useful. >>> >>> I think we can represent frequency classes as ten ranges of percentages >>> with 10 ASCII characters (A-K), as this would be in the tradition of the >>> fsa encoding. So "A" would be the most common words (like 'the' and 'a' >>> in English), etc. I think we don't need to have a better resolution here. >>> >>> Or we could simply use a numerical percentage in its decimal (rounded) >>> representation from 000 to 100. This, however, would make the dictionary >>> slightly bigger. >>> >>> Regards, >>> Marcin >>> Regards, Jaume > Regards, > Marcin > >> Now we need a release with the changes, and we'll be able to adapt the >> tests. >> >> Regards, >> Jaume >> Salutacions, >> Jaume Ortolà >> www.riuraueditors.cat >> >> >> >> 2013/7/15 Marcin Miłkowski : >>> W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze: Thanks, Marcin. Some remarks. The improvements I sent to the list 15 days ago have not been added, and moreover I have found more bugs. >>> I'm really sorry but there are 200 mails from the mailing list over >>> the >>> last two weeks and I have been away from my e-mail. Could you please >>> add >>> your changes as issues on github for morfologik-stemming? This way it >>> would make it much easier for us to track these things. >>> I attach the code I'm using now and explain briefly the reasons for the changes. - In the getAllReplacements method we need to make sure that the replacements are done from left to right. We must complete the for-loop of the replacement pairs, choose the first possible replacement (form left to right) and then start the two new branches (with and without replacement). Otherwise, some replacements are not done. >>> OK, this sounds OK. I integrated your changes. >>> - If there is "ss" as a key in the replacement pairs, and somebody uses a long string of s ("ss...") as input text, this can cause the method to consume all the memory, as the algorithm is exponential (2^(number of replacements)). This happened to us in an online server, and the LT server crashed. The depth of the recursive algorithm should be limited to 4 o 5 levels at most. >>> Is that in getAllReplacements()? >>> - It is possible that different "words to check" give the same suggestion. So at some point we need to remove duplicates. I do this at the end of findReplacements(). >>> You are right. We could probably write the same code in a slightly >>> more >>> elegant way, without converting this to a LinkedHashSet
Re: suggestions in Morfologik spelling rule
By the way, I could help with words frequencies for some langauges. e.g. Portuguese, Spanish, Dutch. Ruud On 16-07-13 14:20, R.J. Baars wrote: > Coding word frequencies as a character is fine. I think it would be > classes, logarithmic as far as I am concerned. > > Ruud > >> W dniu 2013-07-16 00:03, Jaume Ortolà i Font pisze: >>> 2013/7/15 Marcin Miłkowski : Hi Jaume, W dniu 2013-07-15 21:16, Jaume Ortolà i Font pisze: > Hi, Marcin. > > I have tested the current code (1.8.0-SNAPSHOT) and everything is OK, > all the changes are there. Thank you. Great. We'll release 1.7.1, this is just a minor bug fix. BTW, when you see something you want to fix, just make a fork on github to fix it, then file an issue, and then make a pull request associated with that issue. That way, it will be much easier to develop the library with your changes. >>> I'll try to do it. >>> Also, if you'll find time to use a proper way of removing duplicates (now we lose information from CandidateData that might be significant for something - I know this is me being fussy, this is quite clean). >>> There are different ways to do it: >>> - We could test for duplicates in addCandidate()... >>> - "candidates" could be a Set, but then it needs to be converted to a >>> List to be sorted... >> Not really. We can use a TreeSet with a custom comparator: >> >> http://stackoverflow.com/a/4165893 >> >>> If you want to keep the distance information outside Speller.java, >>> that's a different a matter. >>> >>> >>> The next step for improving the suggestions would be to use a list of >>> frequent words. I'm thinking of just a list of manually selected words >>> or at most a few thousand words from a frequency dictionary. >> Yes. Frequency dictionaries would be very useful. >> >> I think we can represent frequency classes as ten ranges of percentages >> with 10 ASCII characters (A-K), as this would be in the tradition of the >> fsa encoding. So "A" would be the most common words (like 'the' and 'a' >> in English), etc. I think we don't need to have a better resolution here. >> >> Or we could simply use a numerical percentage in its decimal (rounded) >> representation from 000 to 100. This, however, would make the dictionary >> slightly bigger. >> >> Regards, >> Marcin >> >>> Regards, >>> Jaume >>> >>> Regards, Marcin > Now we need a release with the changes, and we'll be able to adapt the > tests. > > Regards, > Jaume > Salutacions, > Jaume Ortolà > www.riuraueditors.cat > > > > 2013/7/15 Marcin Miłkowski : >> W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze: >>> Thanks, Marcin. >>> >>> Some remarks. The improvements I sent to the list 15 days ago have >>> not >>> been added, and moreover I have found more bugs. >> I'm really sorry but there are 200 mails from the mailing list over >> the >> last two weeks and I have been away from my e-mail. Could you please >> add >> your changes as issues on github for morfologik-stemming? This way it >> would make it much easier for us to track these things. >> >>> I attach the code I'm using now and explain briefly the reasons for >>> the changes. >>> >>> - In the getAllReplacements method we need to make sure that the >>> replacements are done from left to right. We must complete the >>> for-loop of the replacement pairs, choose the first possible >>> replacement (form left to right) and then start the two new branches >>> (with and without replacement). Otherwise, some replacements are not >>> done. >> OK, this sounds OK. I integrated your changes. >> >>> - If there is "ss" as a key in the replacement pairs, and somebody >>> uses a long string of s ("ss...") as input text, this can >>> cause the method to consume all the memory, as the algorithm is >>> exponential (2^(number of replacements)). This happened to us in an >>> online server, and the LT server crashed. The depth of the recursive >>> algorithm should be limited to 4 o 5 levels at most. >> Is that in getAllReplacements()? >> >>> - It is possible that different "words to check" give the same >>> suggestion. So at some point we need to remove duplicates. I do this >>> at the end of findReplacements(). >> You are right. We could probably write the same code in a slightly >> more >> elegant way, without converting this to a LinkedHashSet but simply by >> adding to a set when iterating the list. >> >>> - The conditions around line 238 (current github version 1.7) are >>> not >>> correct. The first isInDictionary makes the lower case conversion >>> useless: >>> >>> if (isInDictionary(wordChecked) >>> && >>> dictionaryMetadata.isConvertingCase() >>>
Re: suggestions in Morfologik spelling rule
Coding word frequencies as a character is fine. I think it would be classes, logarithmic as far as I am concerned. Ruud > W dniu 2013-07-16 00:03, Jaume Ortolà i Font pisze: >> 2013/7/15 Marcin MiÅkowski : >>> Hi Jaume, >>> >>> W dniu 2013-07-15 21:16, Jaume Ortolà i Font pisze: Hi, Marcin. I have tested the current code (1.8.0-SNAPSHOT) and everything is OK, all the changes are there. Thank you. >>> >>> Great. We'll release 1.7.1, this is just a minor bug fix. >>> >>> BTW, when you see something you want to fix, just make a fork on github >>> to fix it, then file an issue, and then make a pull request associated >>> with that issue. That way, it will be much easier to develop the >>> library >>> with your changes. >> >> I'll try to do it. >> >>> Also, if you'll find time to use a proper way of removing duplicates >>> (now we lose information from CandidateData that might be significant >>> for something - I know this is me being fussy, this is quite clean). >> >> There are different ways to do it: >> - We could test for duplicates in addCandidate()... >> - "candidates" could be a Set, but then it needs to be converted to a >> List to be sorted... > > Not really. We can use a TreeSet with a custom comparator: > > http://stackoverflow.com/a/4165893 > >> >> If you want to keep the distance information outside Speller.java, >> that's a different a matter. >> >> >> The next step for improving the suggestions would be to use a list of >> frequent words. I'm thinking of just a list of manually selected words >> or at most a few thousand words from a frequency dictionary. > > Yes. Frequency dictionaries would be very useful. > > I think we can represent frequency classes as ten ranges of percentages > with 10 ASCII characters (A-K), as this would be in the tradition of the > fsa encoding. So "A" would be the most common words (like 'the' and 'a' > in English), etc. I think we don't need to have a better resolution here. > > Or we could simply use a numerical percentage in its decimal (rounded) > representation from 000 to 100. This, however, would make the dictionary > slightly bigger. > > Regards, > Marcin > >> >> Regards, >> Jaume >> >> >>> Regards, >>> Marcin >>> Now we need a release with the changes, and we'll be able to adapt the tests. Regards, Jaume Salutacions, Jaume Ortolà www.riuraueditors.cat 2013/7/15 Marcin MiÅkowski : > W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze: >> Thanks, Marcin. >> >> Some remarks. The improvements I sent to the list 15 days ago have >> not >> been added, and moreover I have found more bugs. > I'm really sorry but there are 200 mails from the mailing list over > the > last two weeks and I have been away from my e-mail. Could you please > add > your changes as issues on github for morfologik-stemming? This way it > would make it much easier for us to track these things. > >> I attach the code I'm using now and explain briefly the reasons for >> the changes. >> >> - In the getAllReplacements method we need to make sure that the >> replacements are done from left to right. We must complete the >> for-loop of the replacement pairs, choose the first possible >> replacement (form left to right) and then start the two new branches >> (with and without replacement). Otherwise, some replacements are not >> done. > OK, this sounds OK. I integrated your changes. > >> - If there is "ss" as a key in the replacement pairs, and somebody >> uses a long string of s ("ss...") as input text, this can >> cause the method to consume all the memory, as the algorithm is >> exponential (2^(number of replacements)). This happened to us in an >> online server, and the LT server crashed. The depth of the recursive >> algorithm should be limited to 4 o 5 levels at most. > Is that in getAllReplacements()? > >> - It is possible that different "words to check" give the same >> suggestion. So at some point we need to remove duplicates. I do this >> at the end of findReplacements(). > You are right. We could probably write the same code in a slightly > more > elegant way, without converting this to a LinkedHashSet but simply by > adding to a set when iterating the list. > >> - The conditions around line 238 (current github version 1.7) are >> not >> correct. The first isInDictionary makes the lower case conversion >> useless: >> >>if (isInDictionary(wordChecked) >>&& >> dictionaryMetadata.isConvertingCase() >>&& isMixedCase(wordChecked) >>&& >> isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale( >> >> I think they should be something like: >> >>
Re: suggestions in Morfologik spelling rule
W dniu 2013-07-16 00:03, Jaume Ortolà i Font pisze: > 2013/7/15 Marcin Miłkowski : >> Hi Jaume, >> >> W dniu 2013-07-15 21:16, Jaume Ortolà i Font pisze: >>> Hi, Marcin. >>> >>> I have tested the current code (1.8.0-SNAPSHOT) and everything is OK, >>> all the changes are there. Thank you. >> >> Great. We'll release 1.7.1, this is just a minor bug fix. >> >> BTW, when you see something you want to fix, just make a fork on github >> to fix it, then file an issue, and then make a pull request associated >> with that issue. That way, it will be much easier to develop the library >> with your changes. > > I'll try to do it. > >> Also, if you'll find time to use a proper way of removing duplicates >> (now we lose information from CandidateData that might be significant >> for something - I know this is me being fussy, this is quite clean). > > There are different ways to do it: > - We could test for duplicates in addCandidate()... > - "candidates" could be a Set, but then it needs to be converted to a > List to be sorted... Not really. We can use a TreeSet with a custom comparator: http://stackoverflow.com/a/4165893 > > If you want to keep the distance information outside Speller.java, > that's a different a matter. > > > The next step for improving the suggestions would be to use a list of > frequent words. I'm thinking of just a list of manually selected words > or at most a few thousand words from a frequency dictionary. Yes. Frequency dictionaries would be very useful. I think we can represent frequency classes as ten ranges of percentages with 10 ASCII characters (A-K), as this would be in the tradition of the fsa encoding. So "A" would be the most common words (like 'the' and 'a' in English), etc. I think we don't need to have a better resolution here. Or we could simply use a numerical percentage in its decimal (rounded) representation from 000 to 100. This, however, would make the dictionary slightly bigger. Regards, Marcin > > Regards, > Jaume > > >> Regards, >> Marcin >> >>> >>> Now we need a release with the changes, and we'll be able to adapt the >>> tests. >>> >>> Regards, >>> Jaume >>> Salutacions, >>> Jaume Ortolà >>> www.riuraueditors.cat >>> >>> >>> >>> 2013/7/15 Marcin Miłkowski : W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze: > Thanks, Marcin. > > Some remarks. The improvements I sent to the list 15 days ago have not > been added, and moreover I have found more bugs. I'm really sorry but there are 200 mails from the mailing list over the last two weeks and I have been away from my e-mail. Could you please add your changes as issues on github for morfologik-stemming? This way it would make it much easier for us to track these things. > I attach the code I'm using now and explain briefly the reasons for the > changes. > > - In the getAllReplacements method we need to make sure that the > replacements are done from left to right. We must complete the > for-loop of the replacement pairs, choose the first possible > replacement (form left to right) and then start the two new branches > (with and without replacement). Otherwise, some replacements are not > done. OK, this sounds OK. I integrated your changes. > - If there is "ss" as a key in the replacement pairs, and somebody > uses a long string of s ("ss...") as input text, this can > cause the method to consume all the memory, as the algorithm is > exponential (2^(number of replacements)). This happened to us in an > online server, and the LT server crashed. The depth of the recursive > algorithm should be limited to 4 o 5 levels at most. Is that in getAllReplacements()? > - It is possible that different "words to check" give the same > suggestion. So at some point we need to remove duplicates. I do this > at the end of findReplacements(). You are right. We could probably write the same code in a slightly more elegant way, without converting this to a LinkedHashSet but simply by adding to a set when iterating the list. > - The conditions around line 238 (current github version 1.7) are not > correct. The first isInDictionary makes the lower case conversion > useless: > >if (isInDictionary(wordChecked) >&& dictionaryMetadata.isConvertingCase() >&& isMixedCase(wordChecked) >&& > isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale( > > I think they should be something like: > > if (isInDictionary(wordChecked) > || (dictionaryMetadata.convertCase > && isMixedCase(wordChecked) > && isInDictionary(wordChecked > .toLowerCase(dictionaryMetadata.dictionaryLocale Fixed! >>
Re: suggestions in Morfologik spelling rule
2013/7/15 Marcin Miłkowski : > Hi Jaume, > > W dniu 2013-07-15 21:16, Jaume Ortolà i Font pisze: >> Hi, Marcin. >> >> I have tested the current code (1.8.0-SNAPSHOT) and everything is OK, >> all the changes are there. Thank you. > > Great. We'll release 1.7.1, this is just a minor bug fix. > > BTW, when you see something you want to fix, just make a fork on github > to fix it, then file an issue, and then make a pull request associated > with that issue. That way, it will be much easier to develop the library > with your changes. I'll try to do it. > Also, if you'll find time to use a proper way of removing duplicates > (now we lose information from CandidateData that might be significant > for something - I know this is me being fussy, this is quite clean). There are different ways to do it: - We could test for duplicates in addCandidate()... - "candidates" could be a Set, but then it needs to be converted to a List to be sorted... If you want to keep the distance information outside Speller.java, that's a different a matter. The next step for improving the suggestions would be to use a list of frequent words. I'm thinking of just a list of manually selected words or at most a few thousand words from a frequency dictionary. Regards, Jaume > Regards, > Marcin > >> >> Now we need a release with the changes, and we'll be able to adapt the tests. >> >> Regards, >> Jaume >> Salutacions, >> Jaume Ortolà >> www.riuraueditors.cat >> >> >> >> 2013/7/15 Marcin Miłkowski : >>> W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze: Thanks, Marcin. Some remarks. The improvements I sent to the list 15 days ago have not been added, and moreover I have found more bugs. >>> I'm really sorry but there are 200 mails from the mailing list over the >>> last two weeks and I have been away from my e-mail. Could you please add >>> your changes as issues on github for morfologik-stemming? This way it >>> would make it much easier for us to track these things. >>> I attach the code I'm using now and explain briefly the reasons for the changes. - In the getAllReplacements method we need to make sure that the replacements are done from left to right. We must complete the for-loop of the replacement pairs, choose the first possible replacement (form left to right) and then start the two new branches (with and without replacement). Otherwise, some replacements are not done. >>> OK, this sounds OK. I integrated your changes. >>> - If there is "ss" as a key in the replacement pairs, and somebody uses a long string of s ("ss...") as input text, this can cause the method to consume all the memory, as the algorithm is exponential (2^(number of replacements)). This happened to us in an online server, and the LT server crashed. The depth of the recursive algorithm should be limited to 4 o 5 levels at most. >>> Is that in getAllReplacements()? >>> - It is possible that different "words to check" give the same suggestion. So at some point we need to remove duplicates. I do this at the end of findReplacements(). >>> You are right. We could probably write the same code in a slightly more >>> elegant way, without converting this to a LinkedHashSet but simply by >>> adding to a set when iterating the list. >>> - The conditions around line 238 (current github version 1.7) are not correct. The first isInDictionary makes the lower case conversion useless: if (isInDictionary(wordChecked) && dictionaryMetadata.isConvertingCase() && isMixedCase(wordChecked) && isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale( I think they should be something like: if (isInDictionary(wordChecked) || (dictionaryMetadata.convertCase && isMixedCase(wordChecked) && isInDictionary(wordChecked .toLowerCase(dictionaryMetadata.dictionaryLocale >>> Fixed! >>> >>> I tried to add your fixes but your code is now quite far away from ours, >>> so diff does not give any meaningful output. Please review the code on >>> github, and if needed, file an issue over changes that need to be done. >>> >>> Regards, >>> Marcin >>> Regards, Jaume Ortolà Salutacions, Jaume Ortolà www.riuraueditors.cat 2013/7/15 Marcin Miłkowski : > W dniu 2013-07-15 10:56, Marcin Miłkowski pisze: >> Hi, >> >> Dawid just released morfologik 1.7 on Maven. So we can actually go on >> and include a newer version in LT. >> >> The new version still does not support compounding but it has all the >> features required for getting better diacritic suggestions. > Here's the documentation: > > http://wiki.languagetool
Re: suggestions in Morfologik spelling rule
Hi Jaume, W dniu 2013-07-15 21:16, Jaume Ortolà i Font pisze: > Hi, Marcin. > > I have tested the current code (1.8.0-SNAPSHOT) and everything is OK, > all the changes are there. Thank you. Great. We'll release 1.7.1, this is just a minor bug fix. BTW, when you see something you want to fix, just make a fork on github to fix it, then file an issue, and then make a pull request associated with that issue. That way, it will be much easier to develop the library with your changes. Also, if you'll find time to use a proper way of removing duplicates (now we lose information from CandidateData that might be significant for something - I know this is me being fussy, this is quite clean). Regards, Marcin > > Now we need a release with the changes, and we'll be able to adapt the tests. > > Regards, > Jaume > Salutacions, > Jaume Ortolà > www.riuraueditors.cat > > > > 2013/7/15 Marcin Miłkowski : >> W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze: >>> Thanks, Marcin. >>> >>> Some remarks. The improvements I sent to the list 15 days ago have not >>> been added, and moreover I have found more bugs. >> I'm really sorry but there are 200 mails from the mailing list over the >> last two weeks and I have been away from my e-mail. Could you please add >> your changes as issues on github for morfologik-stemming? This way it >> would make it much easier for us to track these things. >> >>> I attach the code I'm using now and explain briefly the reasons for the >>> changes. >>> >>> - In the getAllReplacements method we need to make sure that the >>> replacements are done from left to right. We must complete the >>> for-loop of the replacement pairs, choose the first possible >>> replacement (form left to right) and then start the two new branches >>> (with and without replacement). Otherwise, some replacements are not >>> done. >> OK, this sounds OK. I integrated your changes. >> >>> - If there is "ss" as a key in the replacement pairs, and somebody >>> uses a long string of s ("ss...") as input text, this can >>> cause the method to consume all the memory, as the algorithm is >>> exponential (2^(number of replacements)). This happened to us in an >>> online server, and the LT server crashed. The depth of the recursive >>> algorithm should be limited to 4 o 5 levels at most. >> Is that in getAllReplacements()? >> >>> - It is possible that different "words to check" give the same >>> suggestion. So at some point we need to remove duplicates. I do this >>> at the end of findReplacements(). >> You are right. We could probably write the same code in a slightly more >> elegant way, without converting this to a LinkedHashSet but simply by >> adding to a set when iterating the list. >> >>> - The conditions around line 238 (current github version 1.7) are not >>> correct. The first isInDictionary makes the lower case conversion >>> useless: >>> >>> if (isInDictionary(wordChecked) >>> && dictionaryMetadata.isConvertingCase() >>> && isMixedCase(wordChecked) >>> && >>> isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale( >>> >>> I think they should be something like: >>> >>> if (isInDictionary(wordChecked) >>> || (dictionaryMetadata.convertCase >>> && isMixedCase(wordChecked) >>> && isInDictionary(wordChecked >>> .toLowerCase(dictionaryMetadata.dictionaryLocale >> Fixed! >> >> I tried to add your fixes but your code is now quite far away from ours, >> so diff does not give any meaningful output. Please review the code on >> github, and if needed, file an issue over changes that need to be done. >> >> Regards, >> Marcin >> >>> Regards, >>> Jaume Ortolà >>> Salutacions, >>> Jaume Ortolà >>> www.riuraueditors.cat >>> >>> >>> >>> 2013/7/15 Marcin Miłkowski : W dniu 2013-07-15 10:56, Marcin Miłkowski pisze: > Hi, > > Dawid just released morfologik 1.7 on Maven. So we can actually go on > and include a newer version in LT. > > The new version still does not support compounding but it has all the > features required for getting better diacritic suggestions. Here's the documentation: http://wiki.languagetool.org/hunspell-support#toc5 Best, Marcin > Best, > Marcin > > W dniu 2013-07-02 08:59, Marcin Miłkowski pisze: >> W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze: >>> Hi Marcin, >>> >>> I have been using the still unreleased code of morfologik-stemming and I >>> have made improvements to Speller.java for some previously unforseen >>> cases. See the attachement. >>> >>> In order to complete the development, and test & debug with all >>> languages, perhaps we could include temporarily the morfologik module >>> inside LanguageTool. This will make thinks easier. What do yo think? >
Re: suggestions in Morfologik spelling rule
Hi, Marcin. I have tested the current code (1.8.0-SNAPSHOT) and everything is OK, all the changes are there. Thank you. Now we need a release with the changes, and we'll be able to adapt the tests. Regards, Jaume Salutacions, Jaume Ortolà www.riuraueditors.cat 2013/7/15 Marcin Miłkowski : > W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze: >> Thanks, Marcin. >> >> Some remarks. The improvements I sent to the list 15 days ago have not >> been added, and moreover I have found more bugs. > > I'm really sorry but there are 200 mails from the mailing list over the > last two weeks and I have been away from my e-mail. Could you please add > your changes as issues on github for morfologik-stemming? This way it > would make it much easier for us to track these things. > >> >> I attach the code I'm using now and explain briefly the reasons for the >> changes. >> >> - In the getAllReplacements method we need to make sure that the >> replacements are done from left to right. We must complete the >> for-loop of the replacement pairs, choose the first possible >> replacement (form left to right) and then start the two new branches >> (with and without replacement). Otherwise, some replacements are not >> done. > > OK, this sounds OK. I integrated your changes. > >> - If there is "ss" as a key in the replacement pairs, and somebody >> uses a long string of s ("ss...") as input text, this can >> cause the method to consume all the memory, as the algorithm is >> exponential (2^(number of replacements)). This happened to us in an >> online server, and the LT server crashed. The depth of the recursive >> algorithm should be limited to 4 o 5 levels at most. > > Is that in getAllReplacements()? > >> - It is possible that different "words to check" give the same >> suggestion. So at some point we need to remove duplicates. I do this >> at the end of findReplacements(). > > You are right. We could probably write the same code in a slightly more > elegant way, without converting this to a LinkedHashSet but simply by > adding to a set when iterating the list. > >> >> - The conditions around line 238 (current github version 1.7) are not >> correct. The first isInDictionary makes the lower case conversion >> useless: >> >> if (isInDictionary(wordChecked) >> && dictionaryMetadata.isConvertingCase() >> && isMixedCase(wordChecked) >> && >> isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale( >> >> I think they should be something like: >> >>if (isInDictionary(wordChecked) >>|| (dictionaryMetadata.convertCase >>&& isMixedCase(wordChecked) >>&& isInDictionary(wordChecked >>.toLowerCase(dictionaryMetadata.dictionaryLocale > > Fixed! > > I tried to add your fixes but your code is now quite far away from ours, > so diff does not give any meaningful output. Please review the code on > github, and if needed, file an issue over changes that need to be done. > > Regards, > Marcin > >> >> Regards, >> Jaume Ortolà >> Salutacions, >> Jaume Ortolà >> www.riuraueditors.cat >> >> >> >> 2013/7/15 Marcin Miłkowski : >>> W dniu 2013-07-15 10:56, Marcin Miłkowski pisze: Hi, Dawid just released morfologik 1.7 on Maven. So we can actually go on and include a newer version in LT. The new version still does not support compounding but it has all the features required for getting better diacritic suggestions. >>> >>> Here's the documentation: >>> >>> http://wiki.languagetool.org/hunspell-support#toc5 >>> >>> Best, >>> Marcin >>> >>> Best, Marcin W dniu 2013-07-02 08:59, Marcin Miłkowski pisze: > W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze: >> Hi Marcin, >> >> I have been using the still unreleased code of morfologik-stemming and I >> have made improvements to Speller.java for some previously unforseen >> cases. See the attachement. >> >> In order to complete the development, and test & debug with all >> languages, perhaps we could include temporarily the morfologik module >> inside LanguageTool. This will make thinks easier. What do yo think? > > No. I should make a release, forking morfologik makes no sense to me. > > The only thing that stops me is the lack of time to work on compounds. > > Best, > Marcin > > -- > > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > ___ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > >>> >>> >>> -
Re: suggestions in Morfologik spelling rule
W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze: > Thanks, Marcin. > > Some remarks. The improvements I sent to the list 15 days ago have not > been added, and moreover I have found more bugs. I'm really sorry but there are 200 mails from the mailing list over the last two weeks and I have been away from my e-mail. Could you please add your changes as issues on github for morfologik-stemming? This way it would make it much easier for us to track these things. > > I attach the code I'm using now and explain briefly the reasons for the > changes. > > - In the getAllReplacements method we need to make sure that the > replacements are done from left to right. We must complete the > for-loop of the replacement pairs, choose the first possible > replacement (form left to right) and then start the two new branches > (with and without replacement). Otherwise, some replacements are not > done. OK, this sounds OK. I integrated your changes. > - If there is "ss" as a key in the replacement pairs, and somebody > uses a long string of s ("ss...") as input text, this can > cause the method to consume all the memory, as the algorithm is > exponential (2^(number of replacements)). This happened to us in an > online server, and the LT server crashed. The depth of the recursive > algorithm should be limited to 4 o 5 levels at most. Is that in getAllReplacements()? > - It is possible that different "words to check" give the same > suggestion. So at some point we need to remove duplicates. I do this > at the end of findReplacements(). You are right. We could probably write the same code in a slightly more elegant way, without converting this to a LinkedHashSet but simply by adding to a set when iterating the list. > > - The conditions around line 238 (current github version 1.7) are not > correct. The first isInDictionary makes the lower case conversion > useless: > > if (isInDictionary(wordChecked) > && dictionaryMetadata.isConvertingCase() > && isMixedCase(wordChecked) > && > isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale( > > I think they should be something like: > >if (isInDictionary(wordChecked) >|| (dictionaryMetadata.convertCase >&& isMixedCase(wordChecked) >&& isInDictionary(wordChecked >.toLowerCase(dictionaryMetadata.dictionaryLocale Fixed! I tried to add your fixes but your code is now quite far away from ours, so diff does not give any meaningful output. Please review the code on github, and if needed, file an issue over changes that need to be done. Regards, Marcin > > Regards, > Jaume Ortolà > Salutacions, > Jaume Ortolà > www.riuraueditors.cat > > > > 2013/7/15 Marcin Miłkowski : >> W dniu 2013-07-15 10:56, Marcin Miłkowski pisze: >>> Hi, >>> >>> Dawid just released morfologik 1.7 on Maven. So we can actually go on >>> and include a newer version in LT. >>> >>> The new version still does not support compounding but it has all the >>> features required for getting better diacritic suggestions. >> >> Here's the documentation: >> >> http://wiki.languagetool.org/hunspell-support#toc5 >> >> Best, >> Marcin >> >> >>> Best, >>> Marcin >>> >>> W dniu 2013-07-02 08:59, Marcin Miłkowski pisze: W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze: > Hi Marcin, > > I have been using the still unreleased code of morfologik-stemming and I > have made improvements to Speller.java for some previously unforseen > cases. See the attachement. > > In order to complete the development, and test & debug with all > languages, perhaps we could include temporarily the morfologik module > inside LanguageTool. This will make thinks easier. What do yo think? No. I should make a release, forking morfologik makes no sense to me. The only thing that stops me is the lack of time to work on compounds. Best, Marcin -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>> >> >> >> -- >> See everything from the browser to the database with AppDynamics >> Get end-to-end visibility with application monitoring from AppDynamics >> Isolate bottlenecks and diagnose root cause in seconds. >> Start your free trial of AppDynamics Pro today! >> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >> ___ >> Languagetool-devel m
Re: suggestions in Morfologik spelling rule
Am 15.07.2013 15:41, schrieb Jaume Ortolà i Font: > With the right replacement pairs, "Rhythmus" comes first as expected. > They should be something like: I see, thanks. The problem is in the weird magic I do in CompoundAwareHunspellRule. I will try to fix this. Regards Daniel -- http://www.danielnaber.de -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: suggestions in Morfologik spelling rule
2013/7/15 Daniel Naber : > Am 15.07.2013 12:35, schrieb Marcin Miłkowski: > >> Please review my changes. > > +assertCorrectionsByOrder(rule, "Rytmus", "Remus", "Rhythmus"); > > This new suggestion is not as good as the old one, "Rhythmus" should be > preferred. As this is a classical/typical mistake, could we just list it > somewhere? Like "Rytmus -> Rhythmus"? With the right replacement pairs, "Rhythmus" comes first as expected. They should be something like: fsa.dict.speller.replacement-pairs=ss ß,ae ä,oe ö,ue ü,R Rh,r rh,t th Of course, the list can be expanded... Regards, Jaume -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: suggestions in Morfologik spelling rule
Am 15.07.2013 12:35, schrieb Marcin Miłkowski: > I had to adjust some Catalan and German tests for MorfologikSpeller. > For > German, I also added some values in one of the dictionaries so that > better suggestions are now found. > > Please review my changes. Setting fsa.dict.speller.runon-words=false makes a test fail, namely GermanSpellerRuleTest.testGetSuggestions(). So is there a reason to set that to false? Regards Daniel -- http://www.danielnaber.de -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: suggestions in Morfologik spelling rule
Am 15.07.2013 12:35, schrieb Marcin Miłkowski: > Please review my changes. +assertCorrectionsByOrder(rule, "Rytmus", "Remus", "Rhythmus"); This new suggestion is not as good as the old one, "Rhythmus" should be preferred. As this is a classical/typical mistake, could we just list it somewhere? Like "Rytmus -> Rhythmus"? Regards Daniel -- http://www.danielnaber.de -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: suggestions in Morfologik spelling rule
Thanks, Marcin. Some remarks. The improvements I sent to the list 15 days ago have not been added, and moreover I have found more bugs. I attach the code I'm using now and explain briefly the reasons for the changes. - In the getAllReplacements method we need to make sure that the replacements are done from left to right. We must complete the for-loop of the replacement pairs, choose the first possible replacement (form left to right) and then start the two new branches (with and without replacement). Otherwise, some replacements are not done. - If there is "ss" as a key in the replacement pairs, and somebody uses a long string of s ("ss...") as input text, this can cause the method to consume all the memory, as the algorithm is exponential (2^(number of replacements)). This happened to us in an online server, and the LT server crashed. The depth of the recursive algorithm should be limited to 4 o 5 levels at most. - It is possible that different "words to check" give the same suggestion. So at some point we need to remove duplicates. I do this at the end of findReplacements(). - The conditions around line 238 (current github version 1.7) are not correct. The first isInDictionary makes the lower case conversion useless: if (isInDictionary(wordChecked) && dictionaryMetadata.isConvertingCase() && isMixedCase(wordChecked) && isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale( I think they should be something like: if (isInDictionary(wordChecked) || (dictionaryMetadata.convertCase && isMixedCase(wordChecked) && isInDictionary(wordChecked .toLowerCase(dictionaryMetadata.dictionaryLocale Regards, Jaume Ortolà Salutacions, Jaume Ortolà www.riuraueditors.cat 2013/7/15 Marcin Miłkowski : > W dniu 2013-07-15 10:56, Marcin Miłkowski pisze: >> Hi, >> >> Dawid just released morfologik 1.7 on Maven. So we can actually go on >> and include a newer version in LT. >> >> The new version still does not support compounding but it has all the >> features required for getting better diacritic suggestions. > > Here's the documentation: > > http://wiki.languagetool.org/hunspell-support#toc5 > > Best, > Marcin > > >> Best, >> Marcin >> >> W dniu 2013-07-02 08:59, Marcin Miłkowski pisze: >>> W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze: Hi Marcin, I have been using the still unreleased code of morfologik-stemming and I have made improvements to Speller.java for some previously unforseen cases. See the attachement. In order to complete the development, and test & debug with all languages, perhaps we could include temporarily the morfologik module inside LanguageTool. This will make thinks easier. What do yo think? >>> >>> No. I should make a release, forking morfologik makes no sense to me. >>> >>> The only thing that stops me is the lack of time to work on compounds. >>> >>> Best, >>> Marcin >>> >>> -- >>> >>> This SF.net email is sponsored by Windows: >>> >>> Build for Windows Store. >>> >>> http://p.sf.net/sfu/windows-dev2dev >>> ___ >>> Languagetool-devel mailing list >>> Languagetool-devel@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>> >> > > > -- > See everything from the browser to the database with AppDynamics > Get end-to-end visibility with application monitoring from AppDynamics > Isolate bottlenecks and diagnose root cause in seconds. > Start your free trial of AppDynamics Pro today! > http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk > ___ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel Speller.java Description: Binary data -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: suggestions in Morfologik spelling rule
Hi again, I had to adjust some Catalan and German tests for MorfologikSpeller. For German, I also added some values in one of the dictionaries so that better suggestions are now found. Please review my changes. Best regards, Marcin W dniu 2013-07-15 11:27, Marcin Miłkowski pisze: > W dniu 2013-07-15 10:56, Marcin Miłkowski pisze: >> Hi, >> >> Dawid just released morfologik 1.7 on Maven. So we can actually go on >> and include a newer version in LT. >> >> The new version still does not support compounding but it has all the >> features required for getting better diacritic suggestions. > > Here's the documentation: > > http://wiki.languagetool.org/hunspell-support#toc5 > > Best, > Marcin > > >> Best, >> Marcin >> >> W dniu 2013-07-02 08:59, Marcin Miłkowski pisze: >>> W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze: Hi Marcin, I have been using the still unreleased code of morfologik-stemming and I have made improvements to Speller.java for some previously unforseen cases. See the attachement. In order to complete the development, and test & debug with all languages, perhaps we could include temporarily the morfologik module inside LanguageTool. This will make thinks easier. What do yo think? >>> >>> No. I should make a release, forking morfologik makes no sense to me. >>> >>> The only thing that stops me is the lack of time to work on compounds. >>> >>> Best, >>> Marcin >>> >>> -- >>> >>> This SF.net email is sponsored by Windows: >>> >>> Build for Windows Store. >>> >>> http://p.sf.net/sfu/windows-dev2dev >>> ___ >>> Languagetool-devel mailing list >>> Languagetool-devel@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>> >> > > > -- > See everything from the browser to the database with AppDynamics > Get end-to-end visibility with application monitoring from AppDynamics > Isolate bottlenecks and diagnose root cause in seconds. > Start your free trial of AppDynamics Pro today! > http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk > ___ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: suggestions in Morfologik spelling rule
W dniu 2013-07-15 10:56, Marcin Miłkowski pisze: > Hi, > > Dawid just released morfologik 1.7 on Maven. So we can actually go on > and include a newer version in LT. > > The new version still does not support compounding but it has all the > features required for getting better diacritic suggestions. Here's the documentation: http://wiki.languagetool.org/hunspell-support#toc5 Best, Marcin > Best, > Marcin > > W dniu 2013-07-02 08:59, Marcin Miłkowski pisze: >> W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze: >>> Hi Marcin, >>> >>> I have been using the still unreleased code of morfologik-stemming and I >>> have made improvements to Speller.java for some previously unforseen >>> cases. See the attachement. >>> >>> In order to complete the development, and test & debug with all >>> languages, perhaps we could include temporarily the morfologik module >>> inside LanguageTool. This will make thinks easier. What do yo think? >> >> No. I should make a release, forking morfologik makes no sense to me. >> >> The only thing that stops me is the lack of time to work on compounds. >> >> Best, >> Marcin >> >> -- >> >> This SF.net email is sponsored by Windows: >> >> Build for Windows Store. >> >> http://p.sf.net/sfu/windows-dev2dev >> ___ >> Languagetool-devel mailing list >> Languagetool-devel@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >> > -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: suggestions in Morfologik spelling rule
Hi, Dawid just released morfologik 1.7 on Maven. So we can actually go on and include a newer version in LT. The new version still does not support compounding but it has all the features required for getting better diacritic suggestions. Best, Marcin W dniu 2013-07-02 08:59, Marcin Miłkowski pisze: > W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze: >> Hi Marcin, >> >> I have been using the still unreleased code of morfologik-stemming and I >> have made improvements to Speller.java for some previously unforseen >> cases. See the attachement. >> >> In order to complete the development, and test & debug with all >> languages, perhaps we could include temporarily the morfologik module >> inside LanguageTool. This will make thinks easier. What do yo think? > > No. I should make a release, forking morfologik makes no sense to me. > > The only thing that stops me is the lack of time to work on compounds. > > Best, > Marcin > > -- > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > ___ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: suggestions in Morfologik spelling rule
W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze: > Hi Marcin, > > I have been using the still unreleased code of morfologik-stemming and I > have made improvements to Speller.java for some previously unforseen > cases. See the attachement. > > In order to complete the development, and test & debug with all > languages, perhaps we could include temporarily the morfologik module > inside LanguageTool. This will make thinks easier. What do yo think? No. I should make a release, forking morfologik makes no sense to me. The only thing that stops me is the lack of time to work on compounds. Best, Marcin -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
suggestions in Morfologik spelling rule
Hi Marcin, I have been using the still unreleased code of morfologik-stemming and I have made improvements to Speller.java for some previously unforseen cases. See the attachement. In order to complete the development, and test & debug with all languages, perhaps we could include temporarily the morfologik module inside LanguageTool. This will make thinks easier. What do yo think? Regards, Jaume Ortolà Speller.java Description: Binary data -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel