Hi! I'm building a parser for some Natural Language Processing purpose, ie. parsing the output of a real NLP parser, to convert it into a RDBMS.
At some point, I come across a very annoying type of ambiguity in the input. Let me quickly explain. I've got a rule Word: '|' Lemma Morpho(?) Position(?) POSpeech '|' Morpho: /\\+/ /[a-z\\\\]*/i { $item[2] } Position: /\\:/ /[0-9]+/ { $item[2] } POSpeech: /\\_/ /[^\\|\\s\\(\\)\\_]+/ { $item[2] } Lemma: /[^\\|\\s\\+\\:\\_\\"]+/i { $item[1] } That is supposed to match tokens in the input of the sort |dog+s:43_NN1| or dog+s:43_NN1 or |dog+s_NN1| or |dog:43_NN1| and return { 'Lemma' => 'dog', 'Morpho' => [ 's' ], 'Position' => [ 43 ], 'POSpeech' => NN1 } or a similar but lighter version, according to what is in the input. My problem is that these rules assume that the lemma can't contain any of the characters [+:_]. Unfortunately, sometimes in the input, I find cases where it happens: |50:50:42_MC| or |+50dollar:23_FO| where the lemmas are (resp.) "50:50" and "+50dollar". However, if I change the Lemma rule to Lemma: /[^\\|\\s\\"]+/i { $item[1] } I then never match any Morpho, Position or POSpeech anymore, as what they contain perfectly fits into the requirements for the Lemma What I'd actually need it to match the input from right to left rather than the opposite, so that the rules Position, POSpeech and Morpho are tried before the Lemma one, with the effect that Lemma would just consume the rest of the input. I would appreciate if anyone could give me their opinion and advices on how to get it to work. Cheers, Fabre Lambeau Natural Language & Information Processing Computer Laboratory University of Cambridge 15 JJ Thomson Avenue Cambridge CB3 0FD +44 1223 763 561 +44 7773 745 741 [EMAIL PROTECTED]