As one of the first tasks in the OOXML area I would like to propose to redesign and re-implement the OOXML parser.

At the moment each application has its own OOXML import design. Those of Impress and Calc are basically classic hand written push parser designs while that of Writer is semi-automatically derived from the WordprocessingML specification. For all three designs there is hardly any documentation and their implementation is hard to understand and hard to maintain. All that means that you have to work hard to obtain a working knowledge about the OOXML parser for one application and then, once you have it, can not transfer it to the other applications.

I propose a new and unified approach that will essentially replace the current design and implementation. Using the same framework in all applications has several advantages:

- You only have to learn how to use one well documented framework instead of three different and badly documented XML import techniques.

- It exploits the information given by the OOXML schema to produce automatically some of the code that has to be hand written today.

- It allows automatic analysis of the coverage of the OOXML specification so that we can easily see which parts have already been implemented and which are still missing.

- It will be much more easily understandable than the current OOXML import (especially that of Writer).

The one big downside is that the new design requires basically a reimplementation of the OOXML import. But to everyone who has seen the current implementation might not see that as a downside at all :-)



Development and migration

I propose to do the implementation in a new module (possibly called main/ooxml/) with the goal to eventually (i.e. in a couple of releases) replace main/oox/ and other places that contain OOXML import code. It will not be active by default until every one agrees that it is release ready. Of course, there will be switches to easily (but not accidentally) activate it for development builds.

I also propose to focus first on Impress. Its complexity regarding OOXML is less than that of Writer and Calc and the still existing expertise in this area of OpenOffice is probably larger than in Writer and definitely larger than in Calc.

Development will start with implementation of the new framework that is hinted at above and explained in more detail below. Then the existing Impress import is migrated to the new design by copying and adapting the code. The existing import in main/oox/ remains unchanged.



The new framework

The design of the new framework is based on exploiting the OOXML specification (plural because there are different versions, migration addendums and MS Office specific extensions). A parser generator reads the specs and creates the actual OOXML parser from that. The generated parser will basically be a (nested) stack automaton where each state corresponds roughly to a complex type as defined by the spec. Transitions from on state to another correspond to start and end tags that move from one complex type to another.

The actions that are executed on transitions and which do the actual import work, still have to be provided manually. With an intermediate DSL (domain specific language) that represents the interface between OOXML parser and developer, even this step will become more easy and more robust.

The use of an intermediate DSL also allows tweaking of the rules derived from the OOXML specification should the need arise (to e.g. cope with OOXML files that are not 100% conformant to the specs).

The compile time part of the framework is to be implemented in Java to allow an efficient and fast development process. The runtime part of the framework, including the generated parser will be implemented in C++ and be an integral part of OpenOffice.



Details

At the moment we are using a bare bones XML push parser for reading OOXML files. That means that as the XML parser reads the stream of XML elements it asks the OOXML import code to handle start tags, end tags, and the text in between. It is the task of these callbacks to provide so called contexts for each element. These contexts can then be used to make information like attribute values (which the parser only provides to start tags) accessible to the callbacks of text and end tags. The creation of contexts and persistence of intermediate data is done manually in the existing import code. The new import framework, however, will create it automatically, based on the OOXML specifications and semi automatically based on DSL requests. The automatic part is extracted from the specs and responsible for preprocessing attribute value (e.g. conversion from string to boolean, integer, float/double or enumerations). The semi automatic part is driven by developer supplied information in DSL files and defines the subset of attributes that are really evaluated by the import code.

An example of a DSL file snippet could look like this:

DefineContext(p:CT_Slide, p_CT_Slide_context, attribute bool show, attribute bool showMasterSp, int nSlideCounter);
ProcessTypeStart(p:CT_Slide, p_CT_Slide_context aContext)
{
    // C++ code to import a single slide
    if (aContext.show)
       <do-something>
    ++aContext.nSlideCounter;
}
ProcessTypeEnd(p:CT_Slide, p_CT_Slide_context aContext)
{
    cout << aContext.nSlideCounter << endl;
}


It centers on the CT_Slide complex type that is started by the top level 'sld' element in namespace http://schemas.openxmlformats.org/presentationml/2006/main which is typically abreviated as 'p'. It defines a context class p_CT_Slide_context that contains two attributes show and showMasterSp and an additional variable nSlideCounter. The attributes are filled automatically with values when the 'sld' start tag is seen. Two code snippets are defined to handle the 'sld' start and end tags. Both are provided with an object of the p_CT_Slide_context and can read and write its values.

I have made several experiments regarding the reading of the specification and generation of parsers and am confident that the outlined approach will work. The details, like syntax of the DSL, are not yet fixed.

This may sound like a fixed concept that just needs implementation. It is not. Many details have yet to be figured out. Help on all levels (design, implementation, testing, documentation) is needed and welcome.


Best regards,
Andre

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org
For additional commands, e-mail: dev-h...@openoffice.apache.org

Reply via email to