Questionable whether font-shorthand grammar LL(1)

Jonathan Levinson Sun, 27 Sep 2009 13:49:05 -0700

Hi Vincent,


I dusted off my books on parsing and compiling (also using some
Web-sites to do research) and looked at writing a formal grammar for
font-shorthand.  

 

Because font-variant font-style and font-weight can occur in any order,
I could not (currently) come up with a grammar in which the directing
sets were disjoint for each non-terminal.  So I was unable to come up
with an LL(1) grammar.

 

For instance, here are two productions of my attempt at a grammar: 

 

<style-variant-weight> -> <variant-weight>

<style-variant-weight> -> <variant-style>

 

In each case, the first set of <style-variant-weight> shares a common
element in two different productions, the literal values for variant.
One needs to look ahead one more token to see if one has a
<variant-weight> or a <variant-style>.

 

According to Gough's "Syntax Analysis and Software Tools" (1988) "For
every production of the augmented grammar we derive a set of possible
1-lookahead symbols, which we call the director set for that production.
If and only if the director sets for different productions of the same
non-terminal are disjoint, i.e. have no common elements, is the grammar
LL(1)."

 

Also the grammar is ambiguous as we've discussed. 

 

<font> -> <style-variant-weight> <size> [ / <line-height>] <family>

 

 If the string starts with 'normal' and then goes on to define <size>
and <family> then one isn't sure whether style or variant or weight are
being specified.

 

Somehow one needs to special case 'normal' so that when the string
begins with normal - one value (say font-weight is set) and the other
two are not set which according to the spec means they are reset to
normal as well.

 

The books and web articles I read only discussed using recursive descent
when the grammar is LL(1).  I have the feeling that despite the
ambiguities in the grammar it is almost LL(k) because font-variant and
font-style and font-weight almost have disjoint values.   It is at least
LL(3) and I suspect it is LL(6).

 

Given your greater knowledge of parsing, do you know if an LL(k) parser
can always be implemented as recursive descent if one looks k tokens
ahead in one's parsing routine?

 

I also noticed that the fact that space separates the tokens must be in
an important part of any solution to the problem and that the
font-shorthand is more easily parsed (by any software) from
right-to-left than left-to-right.  This is because font-family is not
nullable and in a right-to-left parsing is the first element
encountered.    A non-terminal symbol is nullable if null can be validly
derived from it in terms of the grammar.

 

I'm not as convinced as you are that recursive descent parsing or a
formal bottom-up-parser will make the code simpler rather than more
complex because of the complexities of a formal grammar.   Of course,
however complex the grammar, a table-generating tool - like ANTLR - will
generate code, however complex, which will faithfully reflect the
inputted grammar.  However, none of the other properties in FOP use a
table-generating tool like ANTLR - and I'm not sure what the
consequences would be to FOP of introducing such a tool.  Given the
complexities of the grammar, I'm sure that a recursive descent parser
will be quite complex, and if we are going to use a grammar driven
approach we would be better off with a tool that generates parsers from
grammars rather than the recursive descent approach.  Also an advantage
of parser generators is that one doesn't have to rewrite so much code to
correct a mistake in one's grammar, if one makes a mistake, or if the
grammar changes.  Recursive descent parsing can pose its own maintenance
nightmares.

 

The current approach in FOP for font-shorthand is obscurely written but
strikes me as basically sound.

 

1)      One parses from right-to-left using the fact that spaces divide
tokens

2)      One lets property makers determine whether they apply to a
token.  Each property maker is a little parser of the token one feeds
it.  Because the property makers determine whether they apply to a
token, one can handle the fact that variant, weight and style can occur
in any order by feeding the current token to each of the property makers
for font-variant, font-weight, and font-style in turn.  Whatever they
accept is ipso-facto a font-variant or a font-weight or font-style.

 

Just want to let you know I take the problem seriously, and I'm not
trying to duck the responsibility of coming up with an adequate
solution.  I'm not sure what I did fits into a "job priority" which is
why I spent many hours this weekend on this research.

 

You are free to disagree with my observations and I notice that on
fop-dev forums you challenge us to make the code simpler, more reusable,
and better structured.

 

Best Regards,

Jonathan S. Levinson

Senior Software Developer

Object Group

InterSystems

617-621-0600

Questionable whether font-shorthand grammar LL(1)

Reply via email to