Hi!
I'm building a parser for some Natural Language Processing purpose, ie.
parsing the output of a real NLP parser, to convert it into a RDBMS.
At some point, I come across a very annoying type of ambiguity in the input.
Let me quickly explain. I've got a rule
Word: '|' Lemma Morpho(?) Position(?) POSpeech '|'
Morpho: /\\+/ /[a-z\\\\]*/i { $item[2] }
Position: /\\:/ /[0-9]+/ { $item[2] }
POSpeech: /\\_/ /[^\\|\\s\\(\\)\\_]+/ { $item[2] }
Lemma: /[^\\|\\s\\+\\:\\_\\"]+/i { $item[1] }
That is supposed to match tokens in the input of the sort
|dog+s:43_NN1| or dog+s:43_NN1 or |dog+s_NN1| or |dog:43_NN1|
and return
{ 'Lemma' => 'dog',
'Morpho' => [ 's' ],
'Position' => [ 43 ],
'POSpeech' => NN1 }
or a similar but lighter version, according to what is in the input.
My problem is that these rules assume that the lemma can't contain any of
the characters [+:_].
Unfortunately, sometimes in the input, I find cases where it happens:
|50:50:42_MC| or |+50dollar:23_FO|
where the lemmas are (resp.) "50:50" and "+50dollar".
However, if I change the Lemma rule to
Lemma: /[^\\|\\s\\"]+/i { $item[1] }
I then never match any Morpho, Position or POSpeech anymore, as what they
contain perfectly fits into the requirements for the Lemma
What I'd actually need it to match the input from right to left rather than
the opposite, so that the rules Position, POSpeech and Morpho are tried
before the Lemma one, with the effect that Lemma would just consume the rest
of the input.
I would appreciate if anyone could give me their opinion and advices on how
to get it to work.
Cheers,
Fabre Lambeau
Natural Language & Information Processing
Computer Laboratory
University of Cambridge
15 JJ Thomson Avenue
Cambridge
CB3 0FD
+44 1223 763 561
+44 7773 745 741
[EMAIL PROTECTED]