Re: The SENT_END challenge
Thank you for your answer, Jaume. On 9 August 2014 11:30, Jaume Ortolà i Font jaumeort...@gmail.com wrote: Hi, A possible and simple solution is to write two rules. One for sentences with ending punctuation: pattern marker token regexp=yes(you|thei|ou)r/token /marker token regexp=yes[.?!]/token /pattern And another one for sentences without ending punctuation: pattern marker token postag=SENT_END regexp=yes(you|thei|ou)r/token /marker /pattern They are in fact two different patterns, so it is logical to use two different rules. Actually they are the same issue, only separated by the lack of a EOS symbol in the second case. The pattern varies because the tokenizing, but the facts is that every rule regarding SENT_END must be duplicated. Since there are many potential rules based on the end of sentence, I think it is worth thinking on a way to avoid this duplication. BTW duplicated code http://en.wikipedia.org/wiki/Duplicate_code is generally considered a code smell http://en.wikipedia.org/wiki/Code_smell and thus should be avoided. Regards, Juan Martorell -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: The SENT_END challenge
Hi, A possible and simple solution is to write two rules. One for sentences with ending punctuation: pattern marker token regexp=yes(you|thei|ou)r/token /marker token regexp=yes[.?!]/token /pattern And another one for sentences without ending punctuation: pattern marker token postag=SENT_END regexp=yes(you|thei|ou)r/token /marker /pattern They are in fact two different patterns, so it is logical to use two different rules. Regards, Jaume Ortolà 2014-08-08 15:21 GMT+02:00 Juan Martorell juan.martor...@gmail.com: Hi all, I've been struggling with some rules in Spanish involving words that cannot close a sentence. They work fine with one exception -- when the sentence does not end with punctuation. To illustrate the issue, I'll contribute with a brand new rule in English. The token to cannot end a sentence after comma, it is very likely a typo and should be too: I want to be astronaut, to. I am contributing, to. The rule I developed, to be included in TO_TOO group (line 2040 as of today) is the following: rule pattern marker token,/token tokento/token /marker token postag=SENT_END/ /pattern messageDid you mean suggestiontoo/suggestion?/message shortPossible typo/short example correction=too type=incorrectI want water, markerto/marker./example example type=correctThat is the target we should aim to./example example type=correctTime to go to bed./example /rule The rule works correctly: juan@ubuntu:~/languagetool/lab$ java -jar ../dist/languagetool-commandline.jar -c UTF8 --language en Expected text language: English Working on STDIN... I am tired, to. 1.) Line 1, column 11, Rule ID: TO_TOO[6] Message: Did you mean 'too'? Suggestion: too I am tired, to. But if you omit the full stop, then comes the trouble: juan@ubuntu:~/languagetool/lab$ java -jar ../dist/languagetool-commandline.jar -c UTF8 --language en -v Expected text language: English Working on STDIN... I am tired, to 1085 rules activated for language English S I[I/PRP,B-NP-singular|E-NP-singular] am[be/VBP,B-VP] tired[tire/VBN,I-VP],[,/,,O] to[to/IN,to/TO,/S,O] P/ Disambiguator log: VBN_VBD:1 tired[tired/JJ,tire/VBD,tire/VBN,I-VP] - tired[tire/VBN,I-VP] And false positives arise like flowers in Spring: I want to see, to hear, to feel 1085 rules activated for language English S I[I/PRP,B-NP-singular|E-NP-singular] want[want/VBP,B-VP] to[to/IN,to/TO,I-VP] see[see/NN,see/VB,see/VBP,I-VP],[,/,,O] to[to/IN,to/TO,B-VP] hear[hear/VB,hear/VBP,I-VP],[,/,,O] to[to/IN,to/TO,B-VP] feel[feel/NN,feel/VB,feel/VBP,/S,I-VP] P/ Disambiguator log: SENT_START_PRP_VB_NN:1 want[want/NN:UN,want/VB,want/VBP,B-VP] - want[want/VBP,B-VP] 1.) Line 7, column 23, Rule ID: TO_TOO[6] Message: Did you mean 'too'? Suggestion: too I want to see, to hear, to feel Thus we have a case. Does anyone have an idea on how to address this challenge? Some ideas: There is a workaround in rule THE_SENT_END. It changes the last token node: token postag=SENT_END regexp=yes\.|\?|!/token This limits the false positives but detection for titles, enumerations and limericks is disabled. There is other rule I brewed, mixing SENT_END with punctuation: rule id=PPR_END name=Posesive at end of sentence type=typographical pattern marker token regexp=yes(you|thei|ou)r/token /marker token postag=SENT_END regexp=yes\.|\?|!/token /pattern messageDid you mean suggestion\1s/suggestion?/message shortPossible typo/short example correction=yours type=incorrectThe choice is markeryour/marker./example example correction=ours type=incorrectWe brought markerour/marker./example example type=correctThey brought their food./example example type=correctTake your chances/example /rule This rule would also be affected. In my opinion, the key is to add a EOS token when it does not exist. At tokenizing time, if the delimiter is a double CR/LF, that should generate a separate token with lemma EOS and POS /S. For instance,: This is your S This[this/DT,B-NP-singular|E-NP-singular] is[be/VBZ,B-VP] your[your/PRP$,/S,I-VP] P/ would become S This[this/DT,B-NP-singular|E-NP-singular] is[be/VBZ,B-VP] your[your/PRP$,B-NP-singular|E-NP-singular]EOS[null,/S,O] P/ If you play around with this, you'll notice that it also affects to disambiguation. Please note that I won't commit the rules so I don't mess up the grammar.xml file. I kindly ask the maintainer to perform such commit. Best regards, Juan Martorell -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck
The SENT_END challenge
Hi all, I've been struggling with some rules in Spanish involving words that cannot close a sentence. They work fine with one exception -- when the sentence does not end with punctuation. To illustrate the issue, I'll contribute with a brand new rule in English. The token to cannot end a sentence after comma, it is very likely a typo and should be too: I want to be astronaut, to. I am contributing, to. The rule I developed, to be included in TO_TOO group (line 2040 as of today) is the following: rule pattern marker token,/token tokento/token /marker token postag=SENT_END/ /pattern messageDid you mean suggestiontoo/suggestion?/message shortPossible typo/short example correction=too type=incorrectI want water, markerto/marker./example example type=correctThat is the target we should aim to./example example type=correctTime to go to bed./example /rule The rule works correctly: juan@ubuntu:~/languagetool/lab$ java -jar ../dist/languagetool-commandline.jar -c UTF8 --language en Expected text language: English Working on STDIN... I am tired, to. 1.) Line 1, column 11, Rule ID: TO_TOO[6] Message: Did you mean 'too'? Suggestion: too I am tired, to. But if you omit the full stop, then comes the trouble: juan@ubuntu:~/languagetool/lab$ java -jar ../dist/languagetool-commandline.jar -c UTF8 --language en -v Expected text language: English Working on STDIN... I am tired, to 1085 rules activated for language English S I[I/PRP,B-NP-singular|E-NP-singular] am[be/VBP,B-VP] tired[tire/VBN,I-VP],[,/,,O] to[to/IN,to/TO,/S,O] P/ Disambiguator log: VBN_VBD:1 tired[tired/JJ,tire/VBD,tire/VBN,I-VP] - tired[tire/VBN,I-VP] And false positives arise like flowers in Spring: I want to see, to hear, to feel 1085 rules activated for language English S I[I/PRP,B-NP-singular|E-NP-singular] want[want/VBP,B-VP] to[to/IN,to/TO,I-VP] see[see/NN,see/VB,see/VBP,I-VP],[,/,,O] to[to/IN,to/TO,B-VP] hear[hear/VB,hear/VBP,I-VP],[,/,,O] to[to/IN,to/TO,B-VP] feel[feel/NN,feel/VB,feel/VBP,/S,I-VP] P/ Disambiguator log: SENT_START_PRP_VB_NN:1 want[want/NN:UN,want/VB,want/VBP,B-VP] - want[want/VBP,B-VP] 1.) Line 7, column 23, Rule ID: TO_TOO[6] Message: Did you mean 'too'? Suggestion: too I want to see, to hear, to feel Thus we have a case. Does anyone have an idea on how to address this challenge? Some ideas: There is a workaround in rule THE_SENT_END. It changes the last token node: token postag=SENT_END regexp=yes\.|\?|!/token This limits the false positives but detection for titles, enumerations and limericks is disabled. There is other rule I brewed, mixing SENT_END with punctuation: rule id=PPR_END name=Posesive at end of sentence type=typographical pattern marker token regexp=yes(you|thei|ou)r/token /marker token postag=SENT_END regexp=yes\.|\?|!/token /pattern messageDid you mean suggestion\1s/suggestion?/message shortPossible typo/short example correction=yours type=incorrectThe choice is markeryour/marker./example example correction=ours type=incorrectWe brought markerour/marker./example example type=correctThey brought their food./example example type=correctTake your chances/example /rule This rule would also be affected. In my opinion, the key is to add a EOS token when it does not exist. At tokenizing time, if the delimiter is a double CR/LF, that should generate a separate token with lemma EOS and POS /S. For instance,: This is your S This[this/DT,B-NP-singular|E-NP-singular] is[be/VBZ,B-VP] your[your/PRP$,/S,I-VP] P/ would become S This[this/DT,B-NP-singular|E-NP-singular] is[be/VBZ,B-VP] your[your/PRP$,B-NP-singular|E-NP-singular]EOS[null,/S,O] P/ If you play around with this, you'll notice that it also affects to disambiguation. Please note that I won't commit the rules so I don't mess up the grammar.xml file. I kindly ask the maintainer to perform such commit. Best regards, Juan Martorell -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel