[PROPOSAL] New OOXML import framework

Andre Fischer Mon, 19 May 2014 06:40:28 -0700

As one of the first tasks in the OOXML area I would like to propose toredesign and re-implement the OOXML parser.

At the moment each application has its own OOXML import design. Those ofImpress and Calc are basically classic hand written push parser designswhile that of Writer is semi-automatically derived from theWordprocessingML specification. For all three designs there is hardlyany documentation and their implementation is hard to understand andhard to maintain. All that means that you have to work hard to obtain aworking knowledge about the OOXML parser for one application and then,once you have it, can not transfer it to the other applications.

I propose a new and unified approach that will essentially replace thecurrent design and implementation. Using the same framework in allapplications has several advantages:

- You only have to learn how to use one well documented frameworkinstead of three different and badly documented XML import techniques.

- It exploits the information given by the OOXML schema to produceautomatically some of the code that has to be hand written today.

- It allows automatic analysis of the coverage of the OOXMLspecification so that we can easily see which parts have already beenimplemented and which are still missing.

- It will be much more easily understandable than the current OOXMLimport (especially that of Writer).

The one big downside is that the new design requires basically areimplementation of the OOXML import. But to everyone who has seen thecurrent implementation might not see that as a downside at all :-)




Development and migration

I propose to do the implementation in a new module (possibly calledmain/ooxml/) with the goal to eventually (i.e. in a couple of releases)replace main/oox/ and other places that contain OOXML import code. Itwill not be active by default until every one agrees that it is releaseready. Of course, there will be switches to easily (but notaccidentally) activate it for development builds.

I also propose to focus first on Impress. Its complexity regardingOOXML is less than that of Writer and Calc and the still existingexpertise in this area of OpenOffice is probably larger than in Writerand definitely larger than in Calc.

Development will start with implementation of the new framework that ishinted at above and explained in more detail below. Then the existingImpress import is migrated to the new design by copying and adapting thecode. The existing import in main/oox/ remains unchanged.




The new framework

The design of the new framework is based on exploiting the OOXMLspecification (plural because there are different versions, migrationaddendums and MS Office specific extensions). A parser generator readsthe specs and creates the actual OOXML parser from that. The generatedparser will basically be a (nested) stack automaton where each statecorresponds roughly to a complex type as defined by the spec.Transitions from on state to another correspond to start and end tagsthat move from one complex type to another.

The actions that are executed on transitions and which do the actualimport work, still have to be provided manually. With an intermediateDSL (domain specific language) that represents the interface betweenOOXML parser and developer, even this step will become more easy andmore robust.

The use of an intermediate DSL also allows tweaking of the rules derivedfrom the OOXML specification should the need arise (to e.g. cope withOOXML files that are not 100% conformant to the specs).

The compile time part of the framework is to be implemented in Java toallow an efficient and fast development process. The runtime part ofthe framework, including the generated parser will be implemented in C++and be an integral part of OpenOffice.




Details

At the moment we are using a bare bones XML push parser for readingOOXML files. That means that as the XML parser reads the stream of XMLelements it asks the OOXML import code to handle start tags, end tags,and the text in between. It is the task of these callbacks to provideso called contexts for each element. These contexts can then be used tomake information like attribute values (which the parser only providesto start tags) accessible to the callbacks of text and end tags.The creation of contexts and persistence of intermediate data is donemanually in the existing import code. The new import framework,however, will create it automatically, based on the OOXML specificationsand semi automatically based on DSL requests. The automatic part isextracted from the specs and responsible for preprocessing attributevalue (e.g. conversion from string to boolean, integer, float/double orenumerations). The semi automatic part is driven by developer suppliedinformation in DSL files and defines the subset of attributes that arereally evaluated by the import code.


An example of a DSL file snippet could look like this:

DefineContext(p:CT_Slide, p_CT_Slide_context, attribute bool show,attribute bool showMasterSp, int nSlideCounter);

ProcessTypeStart(p:CT_Slide, p_CT_Slide_context aContext)
{
    // C++ code to import a single slide
    if (aContext.show)
       <do-something>
    ++aContext.nSlideCounter;
}
ProcessTypeEnd(p:CT_Slide, p_CT_Slide_context aContext)
{
    cout << aContext.nSlideCounter << endl;
}

It centers on the CT_Slide complex type that is started by the top level'sld' element in namespacehttp://schemas.openxmlformats.org/presentationml/2006/main which istypically abreviated as 'p'. It defines a context classp_CT_Slide_context that contains two attributes show and showMasterSpand an additional variable nSlideCounter. The attributes are filledautomatically with values when the 'sld' start tag is seen. Two codesnippets are defined to handle the 'sld' start and end tags. Both areprovided with an object of the p_CT_Slide_context and can read and writeits values.

I have made several experiments regarding the reading of thespecification and generation of parsers and am confident that theoutlined approach will work. The details, like syntax of the DSL, arenot yet fixed.

This may sound like a fixed concept that just needs implementation. Itis not. Many details have yet to be figured out. Help on all levels(design, implementation, testing, documentation) is needed and welcome.



Best regards,
Andre

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PROPOSAL] New OOXML import framework

Reply via email to