Another update of my progress.

I can now create a validating parser, i.e. one that checks that a document conforms to the specs while it parses its content. At the moment the validation is restricted to complex types (as opposed to simple types and attributes) but I think that is the hardest part.

One NFA (non-deterministic finite automaton) is created for each complex type and one for the top level elements. The NFAs are then converted into equivalent DFAs (deterministic FAs) and finally minimized (via the Hopcroft algorithm). The minimization step became necessary when I added support for the 'all' schema element which states that its children each occur once in arbitrary order. Recognizing this with an FA leads to enumerate all permutations of the children. With n children there are n! permutations. Luckily the 'all' element is used only once and then only for 7 children (7! = 5040).

Here are some numbers:
The 1st and 4th edition of the ECMA-376 specification (1st edition is what is used by MS Office, 4th edition is equivalent to the ISO standard) have 40 schema files.
These contain 1917 complex types and 781 simple types.
Used are 1851 complex types and 727 simple types (have to check if there are really unused complex types or if my optimization is broken).

The non-validating parser has 1853 states and 6987 transitions.

The validating parser has 129530 states and 43512 transitions after creating the NFAs.
After conversion to DFAs there remain 20999 states and 73772 transitions.
After minimization there are 6097 states and 34286 transitions.

Please note that the time for parsing OOXML documents does not depend on the number of states or transitions. It only depends on the length of the input. The number of states and transitions only make the parser bigger.

Progress and commits are tracked in issue 125035.

Best regards,
Andre


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org
For additional commands, e-mail: dev-h...@openoffice.apache.org

Reply via email to