Hi!

I'm building a parser for some Natural Language Processing purpose, ie.
parsing the output of a real NLP parser, to convert it into a RDBMS.

At some point, I come across a very annoying type of ambiguity in the input.

Let me quickly explain. I've got a rule

  Word:        '|' Lemma Morpho(?) Position(?) POSpeech '|'
  Morpho:      /\\+/ /[a-z\\\\]*/i              { $item[2] }
  Position:    /\\:/ /[0-9]+/                   { $item[2] }
  POSpeech:    /\\_/ /[^\\|\\s\\(\\)\\_]+/      { $item[2] }
  Lemma:       /[^\\|\\s\\+\\:\\_\\"]+/i        { $item[1] }

That is supposed to match tokens in the input of the sort

  |dog+s:43_NN1| or dog+s:43_NN1 or |dog+s_NN1| or |dog:43_NN1|

and return

 { 'Lemma' => 'dog',
   'Morpho' => [ 's' ],
   'Position' => [ 43 ],
   'POSpeech' => NN1 }

or a similar but lighter version, according to what is in the input.

My problem is that these rules assume that the lemma can't contain any of
the characters [+:_].
Unfortunately, sometimes in the input, I find cases where it happens:

  |50:50:42_MC| or |+50dollar:23_FO|

where the lemmas are (resp.) "50:50" and "+50dollar".

However, if I change the Lemma rule to

  Lemma:       /[^\\|\\s\\"]+/i        { $item[1] }

I then never match any Morpho, Position or POSpeech anymore, as what they
contain perfectly fits into the requirements for the Lemma

What I'd actually need it to match the input from right to left rather than
the opposite, so that the rules Position, POSpeech and Morpho are tried
before the Lemma one, with the effect that Lemma would just consume the rest
of the input.

I would appreciate if anyone could give me their opinion and advices on how
to get it to work.

Cheers,

   Fabre Lambeau
   Natural Language & Information Processing
   Computer Laboratory
   University of Cambridge

   15 JJ Thomson Avenue
   Cambridge
   CB3 0FD
   +44 1223 763 561

   +44 7773 745 741
   [EMAIL PROTECTED]


Reply via email to