On Tue, Jun 10, 2014 at 2:58 AM, Andre Fischer <awf....@gmail.com> wrote:
> Another update of my progress. > > I can now create a validating parser, i.e. one that checks that a document > conforms to the specs while it parses its content. > At the moment the validation is restricted to complex types (as opposed to > simple types and attributes) but I think that is the hardest part. > > One NFA (non-deterministic finite automaton) is created for each complex > type and one for the top level elements. The NFAs are then converted into > equivalent DFAs (deterministic FAs) and finally minimized (via the Hopcroft > algorithm). The minimization step became necessary when I added support > for the 'all' schema element which states that its children each occur once > in arbitrary order. Recognizing this with an FA leads to enumerate all > permutations of the children. With n children there are n! permutations. > Luckily the 'all' element is used only once and then only for 7 children > (7! = 5040). > > Here are some numbers: > The 1st and 4th edition of the ECMA-376 specification (1st edition is what > is used by MS Office, 4th edition is equivalent to the ISO standard) have > 40 schema files. > These contain 1917 complex types and 781 simple types. > Used are 1851 complex types and 727 simple types (have to check if there > are really unused complex types or if my optimization is broken). > > The non-validating parser has 1853 states and 6987 transitions. > > The validating parser has 129530 states and 43512 transitions after > creating the NFAs. > After conversion to DFAs there remain 20999 states and 73772 transitions. > After minimization there are 6097 states and 34286 transitions. > > Please note that the time for parsing OOXML documents does not depend on > the number of states or transitions. It only depends on the length of the > input. The number of states and transitions only make the parser bigger. > > Progress and commits are tracked in issue 125035. Thanks for the update -- very impressive! > > > Best regards, > Andre > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org > For additional commands, e-mail: dev-h...@openoffice.apache.org > > -- ------------------------------------------------------------------------------------------------- MzK "In the midst of winter, I found there was, within me, an invincible summer." -- Albert Camus