Daniel Naber <list2...@danielnaber.de> wrote:

> On 2013-08-13 21:26, Daniel Naber wrote:
>
>> Matching English noun phrases with LT currently seems impossible or
>> awkward (http://wiki.languagetool.org/tips-and-tricks#toc5).
>
> Looking at the example on that page:
>
> [pos="jj"]+
> is equivalent to
> <token postag="jj" skip="-1"><exception negate_pos="yes" scope="next"
> postag="jj"/></token>
>
> The problem with this seems to be that the LT equivalent does not match
> greedily, so from a phrase with three 'jj' tagged tokens you will always
> get only the first one. Or am I missing something?

No it is not equivalent.  And in fact, the example you give
will not work.

In the exception, you have to put not only the POS tag(s) that
you want to skip (fine so far), but also all the POS tag of what
follows (and this is a problem!).  It's a problem because:

* it will skip more than expected
* and it will fail to match if the token after what we skip contains a POS
  tag that is not listed.  Adding new POS tags in the disambiguator
  for example can break rules that use the negate_pos="..."
  as a result.

I've tried to use this mechanism to skip items, but I had
so many issues, that I gave up on it.  Instead, I put several
rules (one rule to skip 1 item, another rule to skip 2 items, etc.).
At least it works better, but it's painful to maintain because
it multiplies rules. Also many rules probably slow down LT.

In the example at http://wiki.languagetool.org/tips-and-tricks
you can see:

[word="the"] [pos="jj"]{1,2} [pos="nn"]

which gives...

<token>the</token>
<token postag="jj" skip="1"><exception scope="next" negate_pos="yes"
postag="jj|nn|SENT_END" postag_regexp="yes"/></token>
<token postag="nn"/>

Notice that the exception must contain ""jj|nn|SENT_END"
and not just "jj" even though we want to skip only "jj".
As a result, it will also skip all the "nn" unfortunately.
Furthermore, if the token after what we want to skip contains
something else (say nn|xx), it will fail, which is quite
unexpected. If the next token contains "nn|xx", then to
match, the exception must contain postag="jj|nn|xx|SENT_END".
In practise, it is not always possible to know what are all the
possibilities for the POS tag after what we skip so its fragile.

So it's not equivalent as it claims to be.

Besides not working well, it is a tad complicated
to understand (hence this long discussion).

So I would very much welcome something else (using
OpenRegex for example) to be able to skip tokens.  I've
needed it so many times :-)

Regards
Dominique

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to