Re: new language
An artificial language. Okay. That explains why it is not in Ethologue. Ruud Op 28-01-15 om 15:02 schreef Буковинець: Hi R.Baars, Wednesday, January 28, 2015, 2:07:14 PM, you wrote: What is the name of that language in English, what is its ISO language code ? Sensar has no ISO code, becouse it`s new language and is in the process of creating. -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: new language
What is the name of that language in English, what is its ISO language code ? Op 28-01-15 om 12:45 schreef Буковинець: Hi Daniel, Wednesday, January 28, 2015, 12:59:29 AM, you wrote: great, welcome the LanguageTool! Which new language would you like to write rules for? You can actually just start writing rules by adding them to the grammar.xml file of a different language, as described at http://wiki.languagetool.org/development-overview. You won't be able to use advanced features like part-of-speech tagging, but it should be good enough to get started. Regards Daniel I would like to write rules for Сенсар: http://ar25.org/article/sensar-magichna-mova-novoyi-rasy.html Ок, I will try add rules to grammar.xml file of Ukrainian. -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Remark on spell check quality
I have been looking into a lot of spellcheck files (hunspell) lately. Most of those don't care about case. So most will support a word like th, because there is the chemical element Th. (Same for all other elements). When converting a spellcheck file, it could be worth checking out if those errors are transported to the LT checker as well. Ruud -- New Year. New Location. New Benefits. New Data Center in Ashburn, VA. GigeNET is offering a free month of service with a new server in Ashburn. Choose from 2 high performing configs, both with 100TB of bandwidth. Higher redundancy.Lower latency.Increased capacity.Completely compliant. http://p.sf.net/sfu/gigenet ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: spell checker enhancement
I know it will be simple to generate ignore rule like this, And I will probably do that, as soon as they pop up in the frequency table. Ruud Op 16-09-14 om 12:01 schreef Marcin Miłkowski: W dniu 2014-09-16 o 11:21, R.J. Baars pisze: Marcin, We don't agree. There is a spellchecker, but also a single word ignore list for it. Yes, but for multi-words, we'd have to use the disambiguator code internally anyway. You ask for yet another notation of the same thing. Notice also that no spell checker will propose Tel Aviv for Aviv. You need to have an XML rule for that. A simple one, to be sure, but still an XML rule. I think it's pretty trivial to go through a list of such words and create parallel lists of ignore-spelling rule for disambiguation and missing part grammar rules. Regards, Marcin There are XML rules, but also a Simplereplace rule, a compounding rule. So apart from the hammer and the screwdriver, there are more tools. But anyway, adding the most frequent ones tot the disambiguator works. Getting rid of wrong postags and 10% reported possible spelling errors on the entire corpus is a higher priority. And fixing false positives. Having almost doubled the amount or rules is enough for this month. Ruud W dniu 2014-09-16 o 09:03, R.J. Baars pisze: A word like 'Aviv'is not correct unless 'Tel' is before it. So it is best to leave Tel and Aviv out of the spell checker. That results in spell checking reporting errors for Aviv. In the disambiguator, there is the option to block that, by making an immunizing rule: !-- Tel Aviv-- rule id=TEL_AVIV name=Tel Aviv pattern tokenTel/token tokenAviv/token /pattern disambig action=ignore_spelling/ /rule That works perfectly. But then, there are a lot of these word combinations. Wouldn't it be better to have a multi-word ignore list for the spell checker? (Or even a multi-word spell checker, not just knowing 'correct' and 'not in list', but 'correct', 'incorrect' and 'not in list') It would not be an enhancement, as this would not give new functionality but cripple the existing one. Also, the ability to use all XML syntax is extremely important to me (I use POS tags and regular expressions), so I wouldn't make use of the multi-word spell checker anyway. So we'd have to introduce a crippled syntax that would look a little bit different for a human being but with no meaningful functional change. I don't think it's worth our time. The spell checker is best for checking individual words. Just like a hammer, it's good for nails, and not for screws. For screws, we have a screwdriver. For multi-word entities, we have more refined tools, like tagging and disambiguation and special attributes. Best, Marcin -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce. Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce. Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce. Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce. Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: spell checker enhancement
I see. This is probably of no use for spellchecking, but it is for postagging. Does Abu Dhabi NPCNG00 cause both words to be tagged with that tag, or are they considered 1 token with that postag? (Might come in handy for just this tagging..) Ruud Op 16-09-14 om 12:56 schreef Jaume Ortolà i Font: Hi, Ruud. I don't find any documentation. It is used in Polish, French, Catalan, Russian, Ukrainian and Spanish. Implementation: Enable it (Java). Create a multiwords.txt in your resources folder like these [1]. The tokens are separated by white space and the tag is separated by a tab. Result: The first token of the multiword is tagged with POSTAG and the last token is tagged with /POSTAG. The MultiwordChunker is case-insensitive. I would like to make it configurable, specially for first letter uppercase. Regards, Jaume [1] https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/pl/src/main/resources/org/languagetool/resource/pl/multiwords.txt https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/ca/src/main/resources/org/languagetool/resource/ca/multiwords.txt 2014-09-16 12:33 GMT+02:00 R.Baars baar...@xs4all.nl mailto:baar...@xs4all.nl: Jaume, thanks, but I am not sure. Depends on its implementation I think. Where can I find more info? Ruud Op 16-09-14 om 12:26 schreef Jaume Ortolà i Font: 2014-09-16 11:21 GMT+02:00 R.J. Baars r.j.ba...@xs4all.nl mailto:r.j.ba...@xs4all.nl: We don't agree. There is a spellchecker, but also a single word ignore list for it. There are XML rules, but also a Simplereplace rule, a compounding rule. So apart from the hammer and the screwdriver, there are more tools. There is indeed another tool for multi-words. It seems that Ruud doesn't know it. We can enable a HybridDisambiguator and add a MultiwordChunker to the disambiguation. With this you can write a list of multi-words with its corresponding tag in a plain text file (multiwords.txt). I use the MultiwordChunker with two objectives: improve disambiguation and avoid spelling matches in multiwords. Would it be useful for you, Ruud? Regards, Jaume But anyway, adding the most frequent ones tot the disambiguator works. Getting rid of wrong postags and 10% reported possible spelling errors on the entire corpus is a higher priority. And fixing false positives. Having almost doubled the amount or rules is enough for this month. Ruud W dniu 2014-09-16 o 09:03, R.J. Baars pisze: A word like 'Aviv'is not correct unless 'Tel' is before it. So it is best to leave Tel and Aviv out of the spell checker. That results in spell checking reporting errors for Aviv. In the disambiguator, there is the option to block that, by making an immunizing rule: !-- Tel Aviv-- rule id=TEL_AVIV name=Tel Aviv pattern tokenTel/token tokenAviv/token /pattern disambig action=ignore_spelling/ /rule That works perfectly. But then, there are a lot of these word combinations. Wouldn't it be better to have a multi-word ignore list for the spell checker? (Or even a multi-word spell checker, not just knowing 'correct' and 'not in list', but 'correct', 'incorrect' and 'not in list') It would not be an enhancement, as this would not give new functionality but cripple the existing one. Also, the ability to use all XML syntax is extremely important to me (I use POS tags and regular expressions), so I wouldn't make use of the multi-word spell checker anyway. So we'd have to introduce a crippled syntax that would look a little bit different for a human being but with no meaningful functional change. I don't think it's worth our time. The spell checker is best for checking individual words. Just like a hammer, it's good for nails, and not for screws. For screws, we have a screwdriver. For multi-word entities, we have more refined tools, like tagging and disambiguation and special attributes. Best, Marcin -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce. Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
Re: spell checker enhancement
How is that done? Ruud Op 16-09-14 om 13:23 schreef Jaume Ortolà i Font: 2014-09-16 13:03 GMT+02:00 R.Baars baar...@xs4all.nl mailto:baar...@xs4all.nl: I see. This is probably of no use for spellchecking, but it is for postagging. It gives no suggestions, but it can be used for avoiding false positives in spellchecking, if you set that tagged words are to be ignored. Does Abu Dhabi NPCNG00 cause both words to be tagged with that tag, or are they considered 1 token with that postag? Tokenization is not changed. In this case: token postag=NPCNG00Abu/token token postag=/NPCNG00Dhabi/token if there are more than two tokens, the inside tokens are not tagged. Perhaps this should be optionally changed (ie, tag the inside tokens too). Regards, Jaume (Might come in handy for just this tagging..) Ruud Op 16-09-14 om 12:56 schreef Jaume Ortolà i Font: Hi, Ruud. I don't find any documentation. It is used in Polish, French, Catalan, Russian, Ukrainian and Spanish. Implementation: Enable it (Java). Create a multiwords.txt in your resources folder like these [1]. The tokens are separated by white space and the tag is separated by a tab. Result: The first token of the multiword is tagged with POSTAG and the last token is tagged with /POSTAG. The MultiwordChunker is case-insensitive. I would like to make it configurable, specially for first letter uppercase. Regards, Jaume [1] https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/pl/src/main/resources/org/languagetool/resource/pl/multiwords.txt https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/ca/src/main/resources/org/languagetool/resource/ca/multiwords.txt 2014-09-16 12:33 GMT+02:00 R.Baars baar...@xs4all.nl mailto:baar...@xs4all.nl: Jaume, thanks, but I am not sure. Depends on its implementation I think. Where can I find more info? Ruud Op 16-09-14 om 12:26 schreef Jaume Ortolà i Font: 2014-09-16 11:21 GMT+02:00 R.J. Baars r.j.ba...@xs4all.nl mailto:r.j.ba...@xs4all.nl: We don't agree. There is a spellchecker, but also a single word ignore list for it. There are XML rules, but also a Simplereplace rule, a compounding rule. So apart from the hammer and the screwdriver, there are more tools. There is indeed another tool for multi-words. It seems that Ruud doesn't know it. We can enable a HybridDisambiguator and add a MultiwordChunker to the disambiguation. With this you can write a list of multi-words with its corresponding tag in a plain text file (multiwords.txt). I use the MultiwordChunker with two objectives: improve disambiguation and avoid spelling matches in multiwords. Would it be useful for you, Ruud? Regards, Jaume But anyway, adding the most frequent ones tot the disambiguator works. Getting rid of wrong postags and 10% reported possible spelling errors on the entire corpus is a higher priority. And fixing false positives. Having almost doubled the amount or rules is enough for this month. Ruud W dniu 2014-09-16 o 09:03, R.J. Baars pisze: A word like 'Aviv'is not correct unless 'Tel' is before it. So it is best to leave Tel and Aviv out of the spell checker. That results in spell checking reporting errors for Aviv. In the disambiguator, there is the option to block that, by making an immunizing rule: !-- Tel Aviv-- rule id=TEL_AVIV name=Tel Aviv pattern tokenTel/token tokenAviv/token /pattern disambig action=ignore_spelling/ /rule That works perfectly. But then, there are a lot of these word combinations. Wouldn't it be better to have a multi-word ignore list for the spell checker? (Or even a multi-word spell checker, not just knowing 'correct' and 'not in list', but 'correct', 'incorrect' and 'not in list') It would not be an enhancement, as this would not give new functionality but cripple the existing one. Also, the ability to use all XML syntax is extremely important to me (I use POS tags and regular expressions), so I wouldn't make use of the multi-word spell checker anyway. So we'd have
Re: spell checker enhancement
Okay, thanks. Good to know this. This is however not the time to do that; currently, there is a lot more to do to make what is already in the Dutch LT of better quality. I will keep it in mind for later, when I have a more clear view of remaining spelling issues. (Currently 10% of the collected sentences from internet sources have at least 1 spelling error.) Ruud Op 16-09-14 om 15:31 schreef Jaume Ortolà i Font: 2014-09-16 14:43 GMT+02:00 R.Baars baar...@xs4all.nl mailto:baar...@xs4all.nl: How is that done? Ruud Do you mean ignoring tagged words in spellchecking (even if they are not in the dictionary)? It's a configurable option of the speller (at least in the Morfologik speller rule). A line of Java code. Jaume Op 16-09-14 om 13:23 schreef Jaume Ortolà i Font: 2014-09-16 13:03 GMT+02:00 R.Baars baar...@xs4all.nl mailto:baar...@xs4all.nl: I see. This is probably of no use for spellchecking, but it is for postagging. It gives no suggestions, but it can be used for avoiding false positives in spellchecking, if you set that tagged words are to be ignored. Does Abu Dhabi NPCNG00 cause both words to be tagged with that tag, or are they considered 1 token with that postag? Tokenization is not changed. In this case: token postag=NPCNG00Abu/token token postag=/NPCNG00Dhabi/token if there are more than two tokens, the inside tokens are not tagged. Perhaps this should be optionally changed (ie, tag the inside tokens too). Regards, Jaume (Might come in handy for just this tagging..) Ruud Op 16-09-14 om 12:56 schreef Jaume Ortolà i Font: Hi, Ruud. I don't find any documentation. It is used in Polish, French, Catalan, Russian, Ukrainian and Spanish. Implementation: Enable it (Java). Create a multiwords.txt in your resources folder like these [1]. The tokens are separated by white space and the tag is separated by a tab. Result: The first token of the multiword is tagged with POSTAG and the last token is tagged with /POSTAG. The MultiwordChunker is case-insensitive. I would like to make it configurable, specially for first letter uppercase. Regards, Jaume [1] https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/pl/src/main/resources/org/languagetool/resource/pl/multiwords.txt https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/ca/src/main/resources/org/languagetool/resource/ca/multiwords.txt 2014-09-16 12:33 GMT+02:00 R.Baars baar...@xs4all.nl mailto:baar...@xs4all.nl: Jaume, thanks, but I am not sure. Depends on its implementation I think. Where can I find more info? Ruud Op 16-09-14 om 12:26 schreef Jaume Ortolà i Font: 2014-09-16 11:21 GMT+02:00 R.J. Baars r.j.ba...@xs4all.nl mailto:r.j.ba...@xs4all.nl: We don't agree. There is a spellchecker, but also a single word ignore list for it. There are XML rules, but also a Simplereplace rule, a compounding rule. So apart from the hammer and the screwdriver, there are more tools. There is indeed another tool for multi-words. It seems that Ruud doesn't know it. We can enable a HybridDisambiguator and add a MultiwordChunker to the disambiguation. With this you can write a list of multi-words with its corresponding tag in a plain text file (multiwords.txt). I use the MultiwordChunker with two objectives: improve disambiguation and avoid spelling matches in multiwords. Would it be useful for you, Ruud? Regards, Jaume But anyway, adding the most frequent ones tot the disambiguator works. Getting rid of wrong postags and 10% reported possible spelling errors on the entire corpus is a higher priority. And fixing false positives. Having almost doubled the amount or rules is enough for this month. Ruud W dniu 2014-09-16 o 09:03, R.J. Baars pisze: A word like 'Aviv'is not correct unless 'Tel' is before it. So it is best to leave Tel and Aviv out of the spell checker. That results in spell checking reporting errors for Aviv. In the disambiguator, there is the option to block that, by making
Re: simplify GenericUnpairedBracketsRule implementation
I think it would be great to be able to maken multi-sentence rules in XML ... I know that is not what is suggested here, but nevertheless... Ruud Op 15-09-14 om 13:00 schreef Daniel Naber: Hi, GenericUnpairedBracketsRule detects quotes that do not get closed etc. So what it does isn't overly complicated, but its implementation is a bit convoluted. I think that's mostly because a quote can be opened in one sentence and closed in another one and that's not an error. But our rules only get single sentences. GenericUnpairedBracketsRule calls Rule.setAsDeleted(), a method that is only used from GenericUnpairedBracketsRule. Rule again has a field ListRuleMatch removedMatches, which adds state to the Rule object, which should just be a static rule that doesn't change once created. There are more methods like Rule.isInRemoved() which only seem to be used for GenericUnpairedBracketsRule. What about adding a new type of rule that doesn't get a single sentence, but the complete text (as a List of AnalyzedSentence objects)? This way the rule can iterate over the complete text and doesn't need to keep its state between sentences. It won't need to implement reset() either. I haven't tried yet whether this works as expected, I wanted to ask for your opinion first. Regards Daniel -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: Dutch WikiCheck
My problem is there are enormous amounts of errors generated by the checks where wiki mark-up is met. Especially name= etc. It is not for me, but for any wikipedia user checking pages .. Maybe a built-in parsoid-like routine? What is it we do check? Is it enough when all wiki mark-up is hidden inside a kind of tag? Ruud Op 15-09-14 om 13:16 schreef Daniel Naber: On 2014-09-15 10:50, R.J. Baars wrote: How can I improve LT specifically for Wikipedia? I would like to remove all false positives, caused by the Wiki markup. Here's what I think is the proper solution (for the use case of checking the recent changes feed): -send the old version of the page to Parsoid (Parsoid is a system that would do the parsing for us, turning Wikipedia markup into something we can actually parse - see https://www.mediawiki.org/wiki/Parsoid) -send the new version of the page to Parsoid -make an XML diff to see where the changes are -run the LT on the text of the paragraphs that have been changed The problem with this is that a) it needs considerable development effort and b) it makes the checking slower and less robust due to the additional http requests that it requires. Working on the rules (grammar.xml/disambiguation.xml) to prevent false alarms caused by Wikipedia markup is a hack, I wouldn't recommend that. I can adjust rules, but how do I get them there to see the results? You could use this (available from https://languagetool.org/download/snapshots/): java -jar languagetool-wikipedia.jar wiki-check http://de.wikipedia.org/wiki/Bielefeld Regards Daniel -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
include xml
I am quite sure that I read somewhere the grammar.xml can include another xml, but cannot find the instructions again. Can someone help me on this? Ruud -- Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: Dump
I did so. Will have to wait some time until the process will skip to another input file, but I will keep you informed. Ruud op 27-05-14 11:06, Marcin Miłkowski schreef: Hi, maybe it was because of a simple mistake in the isNumberOrDot() method. I fixed it, so the today's build should run fine. Could you download the nightly and see whether you get crashes on your data? Best, Marcin W dniu 2014-05-27 09:20, R.J. Baars pisze: Hi. I am currently using languagetool-commandline checking billions of paragraphs. It works fine, except for some dumps, like the ones below. I need the tool to continue, for I need the data. When it has been processed, I might try to find the items it crashes on. It looks like it is all string things. Could it crash in utf8 encoding errors? Ruud java.util.concurrent.ExecutionException: java.lang.StringIndexOutOfBoundsException: String index out of range: 0 at org.languagetool.MultiThreadedJLanguageTool.performCheck(MultiThreadedJLanguageTool.java:101) at org.languagetool.JLanguageTool.check(JLanguageTool.java:576) at org.languagetool.JLanguageTool.check(JLanguageTool.java:534) at org.languagetool.JLanguageTool.check(JLanguageTool.java:530) at org.languagetool.commandline.CommandLineTools.checkText(CommandLineTools.java:96) at org.languagetool.commandline.Main.handleLine(Main.java:386) at org.languagetool.commandline.Main.runOnFileLineByLine(Main.java:286) at org.languagetool.commandline.Main.runOnFile(Main.java:166) at org.languagetool.commandline.Main.main(Main.java:519) Caused by: java.util.concurrent.ExecutionException: java.lang.StringIndexOutOfBoundsException: String index out of range: 0 at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:188) at org.languagetool.MultiThreadedJLanguageTool.performCheck(MultiThreadedJLanguageTool.java:98) ... 8 more Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 0 at java.lang.String.charAt(String.java:658) at org.languagetool.rules.CommaWhitespaceRule.isNumberOrDot(CommaWhitespaceRule.java:130) at org.languagetool.rules.CommaWhitespaceRule.match(CommaWhitespaceRule.java:92) at org.languagetool.JLanguageTool.checkAnalyzedSentence(JLanguageTool.java:686) at org.languagetool.JLanguageTool$TextCheckCallable.call(JLanguageTool.java:995) at org.languagetool.JLanguageTool$TextCheckCallable.call(JLanguageTool.java:962) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Exception in thread main java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.StringIndexOutOfBoundsException: String index out of range: 0 at org.languagetool.MultiThreadedJLanguageTool.performCheck(MultiThreadedJLanguageTool.java:101) at org.languagetool.JLanguageTool.check(JLanguageTool.java:576) at org.languagetool.JLanguageTool.check(JLanguageTool.java:534) at org.languagetool.JLanguageTool.check(JLanguageTool.java:530) at org.languagetool.commandline.CommandLineTools.checkText(CommandLineTools.java:96) at org.languagetool.commandline.Main.handleLine(Main.java:386) at org.languagetool.commandline.Main.runOnFileLineByLine(Main.java:286) at org.languagetool.commandline.Main.runOnFile(Main.java:166) at org.languagetool.commandline.Main.main(Main.java:519) Caused by: java.util.concurrent.ExecutionException: java.lang.StringIndexOutOfBoundsException: String index out of range: 0 at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:188) at org.languagetool.MultiThreadedJLanguageTool.performCheck(MultiThreadedJLanguageTool.java:98) ... 8 more Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 0 at java.lang.String.charAt(String.java:658) at org.languagetool.rules.CommaWhitespaceRule.isNumberOrDot(CommaWhitespaceRule.java:130) at org.languagetool.rules.CommaWhitespaceRule.match(CommaWhitespaceRule.java:92) at org.languagetool.JLanguageTool.checkAnalyzedSentence(JLanguageTool.java:686) at org.languagetool.JLanguageTool$TextCheckCallable.call(JLanguageTool.java:995) at org.languagetool.JLanguageTool$TextCheckCallable.call(JLanguageTool.java:962) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at
Re: homophone detection
Good work has been done using this and more sophisticated tools by the universities of Nijmegen and Tilburg, by A. v d Bosch et al. Their tools are also fully open source. These tools got public as 'valkuil.net' and 'fowl.net'. It requires a quite heavy server. In case you are interested, Prof. v d Bosch is one of my Linked-In-contacts: https://www.linkedin.com/profile/view?id=5639559authType=OUT_OF_NETWORKauthToken=4n9Zlocale=en_USsrchid=525963551399472523579srchindex=1srchtotal=3524trk=vsrp_people_res_nametrkInfo=VSRPsearchId%3A525963551399472523579%2CVSRPtargetId%3A5639559%2CVSRPcmpt%3Aprimary Ruud op 07-05-14 16:16, Daniel Naber schreef: Hi, as you may know, After the Deadline is an Open Source text checker, quite similar to LT. It's not maintained anymore, so why not use some of its ideas in LT? A paper describing AtD is available at [1], it's well-written and provides a good overview of AtD. One interesting idea is to detect wrong words based on statistics. AtD has a (manually created) set of words that can be easily confused. If such a word is found in a text, the probability of that word in its context is calculated and compared to the probability of the similar words in the same context. If the word from the text is less probable, an error is assumed, and a more probable word is suggested. If this approach works, it's easier than writing rules: just add a set of easily confused words like adapt, adopt to a file, and the rest will happen automatically. What you need though is a huge corpus to calculate the probabilities. The Google n-gram corpus[2] might be used for that. AtD has been evaluated against a dyslexia corpus[3] with a recall of 27%. Running LT on the same corpus (see RealWordCorpusEvaluator), we get only 19% recall, and that only considers if an error was detected, not if the correction was correct. So there's clearly something to gain for LT here. I have checked in some prototypical work for a statistical homophone rule in LT into the new 'confusion-rule' branch. Regards Daniel [1] http://aclweb.org/anthology-new/W/W10/W10-0404.pdf [2] http://storage.googleapis.com/books/ngrams/books/datasetsv2.html [3] http://www.dcs.bbk.ac.uk/~jenny/resources.html -- Is your legacy SCM system holding you back? Join Perforce May 7 to find out: #149; 3 signs your SCM is hindering your productivity #149; Requirements for releasing software faster #149; Expert tips and advice for migrating your SCM now http://p.sf.net/sfu/perforce ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Is your legacy SCM system holding you back? Join Perforce May 7 to find out: #149; 3 signs your SCM is hindering your productivity #149; Requirements for releasing software faster #149; Expert tips and advice for migrating your SCM now http://p.sf.net/sfu/perforce ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: Wiki
Yes, that is what I meant, thanks. I am able to log in, but not to edit. After clicking on 'edit', I get a Permission error: Sorry, you can not edit this page. Only members Ruud op 10-01-14 17:19, Daniel Naber schreef: On 2014-01-09 16:26, R.J. Baars wrote: Could someone please this text from the wiki? Interactive testing of rules using a corpus I have removed that paragraph (I guess that's what you meant). My account does not allow edit. It should. Can you log in? What's the error message you get? Regards Daniel -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: showing example sentences in GUI
Aha, that makes sense ... Ruud op 02-12-13 13:19, Daniel Naber schreef: On 2013-12-02 13:10, R.J. Baars wrote: Why show sentences with errors to people that need help getting it right? It is not an objection, more a question of : is there a reason from the user perspective? Usually the sentences come in pairs: an incorrect sentence, and the same sentence with the error corrected. The reason I'm asking is that I think some of the error descriptions are not trivial to understand, so an example might help. Regards Daniel -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: planning LT homepage relaunch
Idea: map the languages on a globe (might be hard for languages that share countries) Or just group them by continent? op 28-10-13 19:40, Marcin Miłkowski schreef: W dniu 2013-10-28 16:57, Daniel Naber pisze: On 2013-10-23 20:08, Daniel Naber wrote: I've planned for long to relaunch the LT homepage with a better design. Now is the chance to finally do that. These are my *personal* ideas about the future of the LanguageTool homepage. I'm looking for feedback. So here's a more detailed concept. This can be used for further discussion and once we think that nothing important is missing, this is also what I will send to the designer. I'll leave out the idea of using a CMS for now because of two reasons: I don't know Joomla or another CMS, so it would be another thing to care about, additionally to the design and structure, and it's not trivial to use it technically: we cannot just switch the DNS for languagetool.org to point to some external CMS host because languagetool.org also runs our API. Elements we need on every page: -Logo and page title or a claim (on the homepage) -link to twitter, facebook, and imprint We could include the twitter feed to have announcements and news on the homepage, and make it easy for others to retweet. -field for subscription to announcement mailing list -site navigation For site navigation, I suggest these items with sub items: -Languages * Overview (what is today http://languagetool.org/languages/) * All 29 or so languages. This is a lot... some structure would be nice. That may be problematic because people who speak a language might be discouraged if they find, for example, linguistic groups... -Download * LO/OO * LanguageToolFx * Plugins - just links to external pages (now on http://languagetool.org/links/) -Support * FAQ - what's now the list of common problems, http://languagetool.org/issues/ * Forum - external, not sure if it can be integrated nicely * Contact -Development * several links to the Wiki * Mailing List * Bug Reports * WikiCheck What's not covered is the 'screen shots' page and the 'usage' page. I think screen shots can be put on the download pages, and usage information is already mostly covered by the popup that opens on download. Homepage: The homepage contains an intro sentence and a text form for trying LT. It is pre-filled with an example text in the user's native language (assuming his browser language is just that). If we don't support that, we fall back to English. The text form doesn't have a drop-down anymore as today for selecting the language, as the language can be switched in the navigation. If the language currently selected has a detail page (like http://languagetool.org/fr/), there's a big link below the text box Learn more about LanguageTool's support for language that leads to that page (any better idea to integrate those language pages?). Visual Design: Modern, but not too stylish. I think http://git-scm.com is a nice example of a modern design. What do you think? Anything that is missing? I like these ideas. Regards, marcin Regards Daniel -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: en bloc
That is french, Why not remove noun tags for this word group in the disambiguator? op 27-10-13 07:08, Kumara Bhikkhu schreef: They walked out of the _hall en bloc_. is flagged by rule id=THREE_NN name=Readability: Three nouns in a row Not sure what's the best way to fix this. Using exception would solve this one but problems would remain for other rules if LT still regards en bloc as two nouns. kb -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: planning LT homepage relaunch
Good free templates as well as glossaries are available using a CMS like Joomla. It also support multiple languages and switching between these. And it supports contirbutions by multiple people. I could talk to my Joomla provider if I could offer you this service from the server I am already paying for, if you would appreciate it. Ruu op 24-10-13 09:32, Mike Unwalla schreef: A clearer website would be good. Sometimes, I struggle to remember where to find information. Information is fragmented in many different locations. You do not necessarily have to pay a web designer. Many designs are free. Refer to http://www.csszengarden.com. If you use a web designer, make sure that the web designer conforms to best practice (W3C standards for CSS, accessibility, internationalization). Many web designers know only about making a website look pretty. I like the idea of giving all languages equal status. English is just another option. I did not know that there are LT web pages in languages other than English. A search page would be useful. (For the TechScribe website, I use the free search tools from www.master.com.) When I first started to use LT, I struggled with much of the technical terminology. Both the LT terminology and the third-party terminology were not clear to me. A glossary would be useful. (I volunteer to develop a glossary.) Regards, Mike Unwalla Contact: www.techscribe.co.uk/techw/contact.htm -Original Message- From: Daniel Naber [mailto:list2...@danielnaber.de] Sent: 23 October 2013 19:09 To: LanguageTool Developer List Subject: planning LT homepage relaunch Hi, I've planned for long to relaunch the LT homepage with a better design. Now is the chance to finally do that. These are my *personal* ideas about the future of the LanguageTool homepage. I'm looking for feedback. Our current homepage (http://languagetool.org) suffers from the following problems: -looks overcrowded -no nice, modern design -all non-English languages are just second-class citizens, for example they cannot easily be reached through the navigation bar -English doesn't have a language-specific page, with room for examples Also, but less important in my opinion: -it's not always clear what content belongs to the homepage, community.languagetool.org, and the Wiki -homepage, community.languagetool.org, and the Wiki have different layout styles -it makes use of a do-it-yourself PHP template system Proposed solution: -looks overcrowded: -plan a navigation bar with only a few main items: Download, Languages, Support, Developer -move everything from the homepage to sub pages except the try it online form and the navigation -no nice design: -pay a professional designer to make a new design -make a modern design but not so modern and stylish that it looks outdated in a year again -all non-English languages are just second-class citizens: -translate the cleaned-up homepage to all languages we support and make sure the user is automatically redirected to the homepage that matches their browser's default language (which is probably their native language) -use Transifex for translations of short texts -make all languages reachable through the navigation bar -keep the current language sub pages like http://languagetool.org/de/ - these are not direct translations from English but often contain more information -English doesn't have a language-specific page, with room for examples: -make an English sub page just like for other languages As you can see, this only addresses the main issues, not what I list above as 'less important' issues. Please let me know what you think. Regards Daniel -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel