Re: The SENT_END challenge

2014-08-10 Thread Juan Martorell
Thank you for your answer, Jaume.


On 9 August 2014 11:30, Jaume Ortolà i Font jaumeort...@gmail.com wrote:

 Hi,

 A possible and simple solution is to write two rules. One for sentences
 with ending punctuation:

 pattern
 marker
 token regexp=yes(you|thei|ou)r/token
 /marker
 token regexp=yes[.?!]/token
 /pattern

 And another one for sentences without ending punctuation:

 pattern
 marker
 token postag=SENT_END regexp=yes(you|thei|ou)r/token
 /marker
 /pattern


 They are in fact two different patterns, so it is logical to use two
 different rules.


Actually they are the same issue, only separated by the lack of a EOS
symbol in the second case. The pattern varies because the tokenizing, but
the facts is that every rule regarding SENT_END must be duplicated. Since
there are many potential rules based on the end of sentence, I think it is
worth thinking on a way to avoid this duplication. BTW duplicated code
http://en.wikipedia.org/wiki/Duplicate_code is generally considered a code
smell http://en.wikipedia.org/wiki/Code_smell and thus should be avoided.

Regards,

Juan Martorell
--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: The SENT_END challenge

2014-08-09 Thread Jaume Ortolà i Font
Hi,

A possible and simple solution is to write two rules. One for sentences
with ending punctuation:

pattern
marker
token regexp=yes(you|thei|ou)r/token
/marker
token regexp=yes[.?!]/token
/pattern

And another one for sentences without ending punctuation:

pattern
marker
token postag=SENT_END regexp=yes(you|thei|ou)r/token
/marker
/pattern


They are in fact two different patterns, so it is logical to use two
different rules.

Regards,
Jaume Ortolà




2014-08-08 15:21 GMT+02:00 Juan Martorell juan.martor...@gmail.com:

 Hi all,

 I've been struggling with some rules in Spanish involving words that
 cannot close a sentence. They work fine with one exception -- when the
 sentence does not end with punctuation.

 To illustrate the issue, I'll contribute with a brand new rule in English.

 The token to cannot end a sentence after comma, it is very likely a typo
 and should be too:

 I want to be astronaut, to.
 I am contributing, to.

 The rule I developed, to be included in TO_TOO group (line 2040 as of
 today) is the following:

  rule
  pattern
 marker
token,/token
tokento/token
 /marker
 token postag=SENT_END/
  /pattern
  messageDid you mean suggestiontoo/suggestion?/message
  shortPossible typo/short
  example correction=too type=incorrectI want water,
 markerto/marker./example
  example type=correctThat is the target we should aim to./example
  example type=correctTime to go to bed./example
  /rule

 The rule works correctly:

 juan@ubuntu:~/languagetool/lab$ java -jar
 ../dist/languagetool-commandline.jar -c UTF8 --language en
 Expected text language: English
 Working on STDIN...
 I am tired, to.

 1.) Line 1, column 11, Rule ID: TO_TOO[6]
 Message: Did you mean 'too'?
 Suggestion: too
 I am tired, to.
   

 But if you omit the full stop, then comes the trouble:

 juan@ubuntu:~/languagetool/lab$ java -jar
 ../dist/languagetool-commandline.jar -c UTF8 --language en -v
 Expected text language: English
 Working on STDIN...
 I am tired, to

 1085 rules activated for language English
 S I[I/PRP,B-NP-singular|E-NP-singular] am[be/VBP,B-VP]
 tired[tire/VBN,I-VP],[,/,,O] to[to/IN,to/TO,/S,O] P/
 Disambiguator log:

 VBN_VBD:1 tired[tired/JJ,tire/VBD,tire/VBN,I-VP] - tired[tire/VBN,I-VP]


 And false positives arise like flowers in Spring:

 I want to see, to hear, to feel

 1085 rules activated for language English
 S I[I/PRP,B-NP-singular|E-NP-singular] want[want/VBP,B-VP]
 to[to/IN,to/TO,I-VP] see[see/NN,see/VB,see/VBP,I-VP],[,/,,O]
 to[to/IN,to/TO,B-VP] hear[hear/VB,hear/VBP,I-VP],[,/,,O]
 to[to/IN,to/TO,B-VP] feel[feel/NN,feel/VB,feel/VBP,/S,I-VP] P/
 Disambiguator log:

 SENT_START_PRP_VB_NN:1 want[want/NN:UN,want/VB,want/VBP,B-VP] -
 want[want/VBP,B-VP]

 1.) Line 7, column 23, Rule ID: TO_TOO[6]
 Message: Did you mean 'too'?
 Suggestion: too
 I want to see, to hear, to feel
   


 Thus we have a case. Does anyone have an idea on how to address this
 challenge?

 Some ideas:

 There is a workaround in rule THE_SENT_END. It changes the last token node:

 token postag=SENT_END regexp=yes\.|\?|!/token

 This limits the false positives but detection for titles, enumerations and
 limericks is disabled.

 There is other rule I brewed, mixing SENT_END with punctuation:

 rule id=PPR_END name=Posesive at end of sentence type=typographical
 
 pattern
 marker
 token regexp=yes(you|thei|ou)r/token
 /marker
 token postag=SENT_END regexp=yes\.|\?|!/token
 /pattern
 messageDid you mean suggestion\1s/suggestion?/message
 shortPossible typo/short
 example correction=yours type=incorrectThe choice is
 markeryour/marker./example
 example correction=ours type=incorrectWe brought
 markerour/marker./example
 example type=correctThey brought their food./example
 example type=correctTake your chances/example
 /rule

 This rule would also be affected.

 In my opinion, the key is to add a EOS token when it does not exist. At
 tokenizing time, if the delimiter is a double CR/LF, that should generate a
 separate token with lemma EOS and POS /S. For instance,:

 This is your

 S This[this/DT,B-NP-singular|E-NP-singular] is[be/VBZ,B-VP]
 your[your/PRP$,/S,I-VP] P/

 would become

 S This[this/DT,B-NP-singular|E-NP-singular]
 is[be/VBZ,B-VP] your[your/PRP$,B-NP-singular|E-NP-singular]EOS[null,/S,O]
 P/

 If you play around with this, you'll notice that it also affects to
 disambiguation.

 Please note that I won't commit the rules so I don't mess up the
 grammar.xml file. I kindly ask the maintainer to perform such commit.

 Best regards,

 Juan Martorell



 --
 Want fast and easy access to all the code in your enterprise? Index and
 search up to 200,000 lines of code with a free copy of Black Duck

The SENT_END challenge

2014-08-08 Thread Juan Martorell
Hi all,

I've been struggling with some rules in Spanish involving words that cannot
close a sentence. They work fine with one exception -- when the sentence
does not end with punctuation.

To illustrate the issue, I'll contribute with a brand new rule in English.

The token to cannot end a sentence after comma, it is very likely a typo
and should be too:

I want to be astronaut, to.
I am contributing, to.

The rule I developed, to be included in TO_TOO group (line 2040 as of
today) is the following:

 rule
 pattern
marker
   token,/token
   tokento/token
/marker
token postag=SENT_END/
 /pattern
 messageDid you mean suggestiontoo/suggestion?/message
 shortPossible typo/short
 example correction=too type=incorrectI want water,
markerto/marker./example
 example type=correctThat is the target we should aim to./example
 example type=correctTime to go to bed./example
 /rule

The rule works correctly:

juan@ubuntu:~/languagetool/lab$ java -jar
../dist/languagetool-commandline.jar -c UTF8 --language en
Expected text language: English
Working on STDIN...
I am tired, to.

1.) Line 1, column 11, Rule ID: TO_TOO[6]
Message: Did you mean 'too'?
Suggestion: too
I am tired, to.
  

But if you omit the full stop, then comes the trouble:

juan@ubuntu:~/languagetool/lab$ java -jar
../dist/languagetool-commandline.jar -c UTF8 --language en -v
Expected text language: English
Working on STDIN...
I am tired, to

1085 rules activated for language English
S I[I/PRP,B-NP-singular|E-NP-singular] am[be/VBP,B-VP]
tired[tire/VBN,I-VP],[,/,,O] to[to/IN,to/TO,/S,O] P/
Disambiguator log:

VBN_VBD:1 tired[tired/JJ,tire/VBD,tire/VBN,I-VP] - tired[tire/VBN,I-VP]


And false positives arise like flowers in Spring:

I want to see, to hear, to feel

1085 rules activated for language English
S I[I/PRP,B-NP-singular|E-NP-singular] want[want/VBP,B-VP]
to[to/IN,to/TO,I-VP] see[see/NN,see/VB,see/VBP,I-VP],[,/,,O]
to[to/IN,to/TO,B-VP] hear[hear/VB,hear/VBP,I-VP],[,/,,O]
to[to/IN,to/TO,B-VP] feel[feel/NN,feel/VB,feel/VBP,/S,I-VP] P/
Disambiguator log:

SENT_START_PRP_VB_NN:1 want[want/NN:UN,want/VB,want/VBP,B-VP] -
want[want/VBP,B-VP]

1.) Line 7, column 23, Rule ID: TO_TOO[6]
Message: Did you mean 'too'?
Suggestion: too
I want to see, to hear, to feel
  


Thus we have a case. Does anyone have an idea on how to address this
challenge?

Some ideas:

There is a workaround in rule THE_SENT_END. It changes the last token node:

token postag=SENT_END regexp=yes\.|\?|!/token

This limits the false positives but detection for titles, enumerations and
limericks is disabled.

There is other rule I brewed, mixing SENT_END with punctuation:

rule id=PPR_END name=Posesive at end of sentence type=typographical
pattern
marker
token regexp=yes(you|thei|ou)r/token
/marker
token postag=SENT_END regexp=yes\.|\?|!/token
/pattern
messageDid you mean suggestion\1s/suggestion?/message
shortPossible typo/short
example correction=yours type=incorrectThe choice is
markeryour/marker./example
example correction=ours type=incorrectWe brought
markerour/marker./example
example type=correctThey brought their food./example
example type=correctTake your chances/example
/rule

This rule would also be affected.

In my opinion, the key is to add a EOS token when it does not exist. At
tokenizing time, if the delimiter is a double CR/LF, that should generate a
separate token with lemma EOS and POS /S. For instance,:

This is your

S This[this/DT,B-NP-singular|E-NP-singular] is[be/VBZ,B-VP]
your[your/PRP$,/S,I-VP] P/

would become

S This[this/DT,B-NP-singular|E-NP-singular]
is[be/VBZ,B-VP] your[your/PRP$,B-NP-singular|E-NP-singular]EOS[null,/S,O]
P/

If you play around with this, you'll notice that it also affects to
disambiguation.

Please note that I won't commit the rules so I don't mess up the
grammar.xml file. I kindly ask the maintainer to perform such commit.

Best regards,

Juan Martorell
--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel