Daniel Naber <list2...@danielnaber.de> wrote: > On 2013-08-13 21:26, Daniel Naber wrote: > >> Matching English noun phrases with LT currently seems impossible or >> awkward (http://wiki.languagetool.org/tips-and-tricks#toc5). > > Looking at the example on that page: > > [pos="jj"]+ > is equivalent to > <token postag="jj" skip="-1"><exception negate_pos="yes" scope="next" > postag="jj"/></token> > > The problem with this seems to be that the LT equivalent does not match > greedily, so from a phrase with three 'jj' tagged tokens you will always > get only the first one. Or am I missing something?
No it is not equivalent. And in fact, the example you give will not work. In the exception, you have to put not only the POS tag(s) that you want to skip (fine so far), but also all the POS tag of what follows (and this is a problem!). It's a problem because: * it will skip more than expected * and it will fail to match if the token after what we skip contains a POS tag that is not listed. Adding new POS tags in the disambiguator for example can break rules that use the negate_pos="..." as a result. I've tried to use this mechanism to skip items, but I had so many issues, that I gave up on it. Instead, I put several rules (one rule to skip 1 item, another rule to skip 2 items, etc.). At least it works better, but it's painful to maintain because it multiplies rules. Also many rules probably slow down LT. In the example at http://wiki.languagetool.org/tips-and-tricks you can see: [word="the"] [pos="jj"]{1,2} [pos="nn"] which gives... <token>the</token> <token postag="jj" skip="1"><exception scope="next" negate_pos="yes" postag="jj|nn|SENT_END" postag_regexp="yes"/></token> <token postag="nn"/> Notice that the exception must contain ""jj|nn|SENT_END" and not just "jj" even though we want to skip only "jj". As a result, it will also skip all the "nn" unfortunately. Furthermore, if the token after what we want to skip contains something else (say nn|xx), it will fail, which is quite unexpected. If the next token contains "nn|xx", then to match, the exception must contain postag="jj|nn|xx|SENT_END". In practise, it is not always possible to know what are all the possibilities for the POS tag after what we skip so its fragile. So it's not equivalent as it claims to be. Besides not working well, it is a tad complicated to understand (hence this long discussion). So I would very much welcome something else (using OpenRegex for example) to be able to skip tokens. I've needed it so many times :-) Regards Dominique ------------------------------------------------------------------------------ Get 100% visibility into Java/.NET code with AppDynamics Lite! It's a free troubleshooting tool designed for production. Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel