Hi,

I'm trying to implement a simple parser for a treebank where a phrase
looks like this :

A01    3
[S&[N \0Mr_NPT Michael_NP Foot_NP N][V has_HVZ put_VBN V][R down_RP R][N
a_AT resolution_NN [P on_IN [N the_ATI subject_NN N]P]N][S+ and_CC [Na
he_PP3A Na][V is_BEZ V][Ti[Vi to_TO be_BE backed_VBN Vi][P by_IN [N
\0Mr_NPT Will_NP Griffiths_NP ,_, [N \0MP_NPT [P for_IN [N Manchester_NP
Exchange_NP N]P]N]N]P]Ti]S+] ._. S&]

Here's a light version of the grammar I developed for this treebank :

Helpers

 digit = ['0' .. '9'];
 single_quote = ''';
 category_code = ['A' .. 'H'] | ['J' .. 'N'] | 'P' | 'R';
 coordination_suffix = '&' | '+' + '-';
 word_tags = 'NPT' | 'NP' | 'HVZ' | 'VBN' | 'RP' | 'AT' | 'NN' | 'IN' |
'ATI' | 'CC' | 'PP3A' | 'BEZ' | 'TO' | 'BE';
 idiom_tags = word_tags (digit digit)?;
 sentence_tags = 'S' | 'N' | 'V' | 'R' | 'P' | 'Na' | 'Ti' | 'Vi';
 escape_sequence = '\' digit?;
 cr = 13;
 lf = 10;
 tab = 9;
 ascii_character = [0 .. 0xFF];
 /* Is there a nice way to do that ? */
 not_newline = [ascii_character - [cr + lf]];
 not_blank = [not_newline - [tab + ' ']];
 not_bracket = [not_blank - ['[' + ']']];
 not_parenth = [not_bracket - ['(' + ')']];
 not_colons = [not_parenth - [':' + ';']];
 not_periods = [not_colons - ['.' + ',']];
 word_character = [not_periods - ['_' + '?']];

Tokens

 category_label = category_code digit digit;
 phrase_identifier = digit+;
 l_bracket = '[';
 r_bracket = ']';
 punctuation_tag = '(' | ')' | '*' single_quote | '**' single_quote | '*-'
| ',' | '.' | '...' | ':' | ';' | '?';
 constituent_tag = (idiom_tags | sentence_tags) coordination_suffix?;
 word = escape_sequence? word_character+;
 word_separator = '_';
 blank = (cr | lf | tab | ' ')+;

Ignored Tokens

    blank;

Productions

 parsed_corpus = parsed_phrase+;
 parsed_phrase = category_label phrase_identifier phrase_contents;
 phrase_contents = [left]:punctuation? phrase [right]:punctuation?;
 phrase = l_bracket [left]:phrase_tag compound_phrase+ [right]:phrase_tag
r_bracket;
 compound_phrase =
  {punct_phrase} punctuation |
  {tword_phrase} single_word |
  {recur_phrase} phrase;
 phrase_tag =
  {single_tag} constituent_tag;
 single_word = word word_separator [word_tag]:constituent_tag;
 punctuation = [left]:punctuation_tag word_separator [right]:punctuation_tag;


The problem is that, at some point, a tag token is identified as a word
token by the lexer, generating a parser exception. More specifically, "S+"
is identified as a word in the string "[S+ ... S+]" while it should be
identified as a tag.

As far a I understand, token priority is given by declaration order. Since
"constituent_tag" is declared before "word", I don't understand why the
latter is used for "S+".

Now, if I explicitly add 'S+' to the "sentence_tags" helper definition,
there is no problem and everything's ok. Why ? It is also worth to note
that the 'S&' at the beginning of the phrase is being parsed correctly.
Why ?

I'm trying to understand what's going on here in order to find a work around.

Cheers,

Sebastian


_______________________________________________
SableCC-Discussion mailing list
[email protected]
http://lists.sablecc.org/listinfo/sablecc-discussion

Reply via email to