Re: Suggestion: find POS tag of portion of a word in XML rules

Dominique Pellé Wed, 10 Sep 2014 02:35:24 -0700

Marcin Miłkowski <list-addr...@wp.pl> wrote:

W dniu 2014-09-09 23:10, Dominique Pellé pisze:
> > Daniel Naber <daniel.na...@languagetool.org
> > <mailto:daniel.na...@languagetool.org>> wrote:
> >
> >     On 2014-09-09 22:38, Dominique Pellé wrote:
> >
> >     > * why does your example give a message in
> >     >   the java rule.  Why can't we use <message…></message>
> >     >   instead?
> >
> >     You're right, my example was misleading. <message> can be used.
> >
> >     > * you wrote that args="no:1" refers to the token.
> >     >   What about if we need to use this for one of the
> >     >    <exception>...</exception> inside a token?
> >
> >     We could introduce more attributes like maybe 'regexp_negate'.
> >
> >     > In other words, the rule matches token "(.*)-tu"  where
> >     > the POS of portion in parentheses has to be a verb (V.*).
> >     > But there is an exception if the POS of partion in parenthesis
> >     > matches "V.* 2 .*". So that rule would correctly:
> >
> >     Couldn't that also be expressed with "V.* [13] .*"?
> >
> >
> >
> > No, that would miss at least infinitive verbs "V inf"
> > (e.g. chanter) participles  "V ppa m s"  (chanté)
> > and "V ppr" (chantant).
> >
> > We could of course come up with a regexp that
> > matches all the possible verbs POS  except those
> > "V.* 2 .*" to avoid an exception, but:
> >
> > * that regexp might be rather long as there are
> >    many kinds of  POS verbs. Using an exception is
> >    this more natural.
> > * and more generally speaking, being able to
> >    match POS of portion of token in exception
> >    can be useful in some other cases anyway too.
>
> Let me understand your problem:
>
> * you want to match all verbs (V.*) that have "-tu" at the end (this is
> <token>postag='V.*' postag_regexp="yes" regexp="yes">.*-tu</token>)
> * but not the ones that have verb in a second person: V.* 2 .* So why
> not simply use the old <exception postag="V.* 2 .*"
> postag_regexp="yes"/>? It will be a little bit slow due to regular
> expressions but does everything you need, right? Or am I missing something




Hi Marcin

Not exactly.

I want to find error in things like "Peut-tu" and "Peux-il" which
are both incorrect in French. Correct should be "Peux-tu" (= Can you...)
and "Peut-il" (Can he...)

"Peut-tu" token does not have a POS tag (so what you wrote above
does not work).  It's an invalid word. Interestingly, it's not even marked
as invalid by the spelling checker, because Hunspell splits it with
the dash, and "Peut" as well as "tu" are both valid words.

To detect "Peut-tu" as a mistake, a grammar rule could check that
the POS tag of the portion "Peut" is "V ind pres 3 s" and since -tu
expects a verb before the dash with POS like "V .* 2 .*",
there is a mistake.

For the correct "Peux-tu", the portion "Peux" has POS tag "V ind pres 2 s"
and since -tu expects a verb before the bash with POS like "V .*2 .*"
no error would be given.

However, I currently don't have a mechanism to find the POS tag of a
portion of a token such "Peut-tu" so I  can't write such a rule.

I hope that's clearer now.

Regards
Dominique

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk

_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: Suggestion: find POS tag of portion of a word in XML rules

Reply via email to