J.Pietschmann wrote:
Peter B. West wrote:
 > With my naive understanding of parsing as a two-stage process (lexemes
 > -> higher level constructs) I have been curious about earlier comments
 > of yours about multi-stage parsing.  Can ANTLR do this sort of thing?

I'm not quite sure whether you mean by "parsing as a two-stage
process" the same as I do. In language specs, the formal description
is usually divided into a grammar level representing a Chomsky
level 2 context free grammar and a lexical level, described by simple
regular expressions (Chomsy level 0 IIRC). This is done both for
keeping the spec readable and for efficient implementation

...


This is basically what I meant - I see (and have experienced in FOP) the
difficulty of trying to parse "multiple" grammars out of a single stream
of lexical objects.

> Given the amount of hacking I had to do to parse everything that could > legally be thrown at me, I am very surprised that these are the only > issues in HEAD parsing.

Well, one of the problems with the FO spec is that section 5.9
defines a grammar for property expressions, but this doesn't
give the whole picture for all XML attribute values in FO files.
There are also (mostly) whitespace separated lists for shorthands,
and the comma separated font family name list, where
a) whitespace is allowed around the commas and
b) quotes around the names may be omitted basically as long
 as there are no commas or whitespace in the name.
The latter means there may be unquoted sequences of characters
which has to be interpreted as a single token but are not NCNames.
It also means the in the "font" shorthand there may be whitespace
which is not a list element delimiter. I think this is valid:
 font="bold 12pt 'Times Roman' , serif"
and it should be parsed as
 font-weight="bold"
 font-size="12pt"
 font-family="'Times Roman' , serif"
then the font family can be split. This is easy for humans but can
be quite tricky to get right for computers, given that the shorthand
list has a bunch of optional elements. Specifically
 font="bold small-caps italic 12pt/14pt 'Times Roman' , A+B,serif"
should be valid too. At least, the font family is the last entry.
Note that suddenly a slash appears as delimiter between font size
and line height...

This usage, AFAICT, is the reason that division is specified by the token 'div'. All a matter of CSS compatibility.

Another set of problems is token typing, the implicit type conversion
and the very implicit type specification for the properties. While
often harmless, it shows itself for the "format" property: the
spec says the expected type is a string, which means it should be
written as format="'01'". Of course, people tend to write
format="01". While the parsed number could be cast back into a
string, unfortunately the leading zero is lost. The errata
amended 5.9 specifically for this use case that in case of an
error the original string representation of the property value
expression should be used to recover. Which temps me to use
initial-page-number="auto+1".

This is one of the disgraces of the spec - this time for compatibility with XSLT usage. XSL-FO just cops it sweet whenever someone else's problem (SEP) extrudes into the XSL namespace.

Another famous case is hyphenation-char="-", which is by no
means a valid property expression. Additionally the restriction
to a string of length 1 (a "char") isn't spelled out explicitly
anywhere.

Peter -- Peter B. West <http://www.powerup.com.au/~pbwest/resume.html>




Reply via email to