Hi,
I'm trying to implement a simple parser for a treebank where a phrase
looks like this :
A01 3
[S&[N \0Mr_NPT Michael_NP Foot_NP N][V has_HVZ put_VBN V][R down_RP R][N
a_AT resolution_NN [P on_IN [N the_ATI subject_NN N]P]N][S+ and_CC [Na
he_PP3A Na][V is_BEZ V][Ti[Vi to_TO be_BE backed_VBN Vi][P by_IN [N
\0Mr_NPT Will_NP Griffiths_NP ,_, [N \0MP_NPT [P for_IN [N Manchester_NP
Exchange_NP N]P]N]N]P]Ti]S+] ._. S&]
Here's a light version of the grammar I developed for this treebank :
Helpers
digit = ['0' .. '9'];
single_quote = ''';
category_code = ['A' .. 'H'] | ['J' .. 'N'] | 'P' | 'R';
coordination_suffix = '&' | '+' + '-';
word_tags = 'NPT' | 'NP' | 'HVZ' | 'VBN' | 'RP' | 'AT' | 'NN' | 'IN' |
'ATI' | 'CC' | 'PP3A' | 'BEZ' | 'TO' | 'BE';
idiom_tags = word_tags (digit digit)?;
sentence_tags = 'S' | 'N' | 'V' | 'R' | 'P' | 'Na' | 'Ti' | 'Vi';
escape_sequence = '\' digit?;
cr = 13;
lf = 10;
tab = 9;
ascii_character = [0 .. 0xFF];
/* Is there a nice way to do that ? */
not_newline = [ascii_character - [cr + lf]];
not_blank = [not_newline - [tab + ' ']];
not_bracket = [not_blank - ['[' + ']']];
not_parenth = [not_bracket - ['(' + ')']];
not_colons = [not_parenth - [':' + ';']];
not_periods = [not_colons - ['.' + ',']];
word_character = [not_periods - ['_' + '?']];
Tokens
category_label = category_code digit digit;
phrase_identifier = digit+;
l_bracket = '[';
r_bracket = ']';
punctuation_tag = '(' | ')' | '*' single_quote | '**' single_quote | '*-'
| ',' | '.' | '...' | ':' | ';' | '?';
constituent_tag = (idiom_tags | sentence_tags) coordination_suffix?;
word = escape_sequence? word_character+;
word_separator = '_';
blank = (cr | lf | tab | ' ')+;
Ignored Tokens
blank;
Productions
parsed_corpus = parsed_phrase+;
parsed_phrase = category_label phrase_identifier phrase_contents;
phrase_contents = [left]:punctuation? phrase [right]:punctuation?;
phrase = l_bracket [left]:phrase_tag compound_phrase+ [right]:phrase_tag
r_bracket;
compound_phrase =
{punct_phrase} punctuation |
{tword_phrase} single_word |
{recur_phrase} phrase;
phrase_tag =
{single_tag} constituent_tag;
single_word = word word_separator [word_tag]:constituent_tag;
punctuation = [left]:punctuation_tag word_separator [right]:punctuation_tag;
The problem is that, at some point, a tag token is identified as a word
token by the lexer, generating a parser exception. More specifically, "S+"
is identified as a word in the string "[S+ ... S+]" while it should be
identified as a tag.
As far a I understand, token priority is given by declaration order. Since
"constituent_tag" is declared before "word", I don't understand why the
latter is used for "S+".
Now, if I explicitly add 'S+' to the "sentence_tags" helper definition,
there is no problem and everything's ok. Why ? It is also worth to note
that the 'S&' at the beginning of the phrase is being parsed correctly.
Why ?
I'm trying to understand what's going on here in order to find a work around.
Cheers,
Sebastian
_______________________________________________
SableCC-Discussion mailing list
[email protected]
http://lists.sablecc.org/listinfo/sablecc-discussion