Re: [PROPOSAL] New OOXML import framework
On 20.05.2014 23:38, Andrea Pescetti wrote: On 19/05/2014 Andre Fischer wrote: As one of the first tasks in the OOXML area I would like to propose to redesign and re-implement the OOXML parser. I can only agree with this one. We've already discussed it many times, but even the many users who prefer ODF need a good support for OOXML for interoperability, and better support for the Microsoft Office native formats is consistently in the top requests. I propose a new and unified approach that will essentially replace the current design and implementation. Sounds good. Especially the idea to be able to automatically know how much of the specification is covered will be helpful. I also propose to focus first on Impress. Its complexity regarding OOXML is less than that of Writer and Calc And this is probably good for users too. In my experience, the import of .PPTX files is the most unsatisfactory one at the moment, with many obvious deficiencies. Improving this one first would already give good results for users. I have made several experiments regarding the reading of the specification and generation of parsers and am confident that the outlined approach will work. A not-so-original question: we have another Apache project, POI, http://poi.apache.org/ that among the other things has an OOXML parser. If we are starting from scratch, why not reusing their code? And, if there are reasons for not reusing it, could we validate this roadmap with the POI developers, who are probably more familiar with OOXML parsing than the average reader of this list? First, we are not really starting from scratch. There are several components to importing OOXML files. Two important ones are the parser that reads (OO)XML streams and turns them into events for start tags, end tags, text, etc. The second part are the callbacks that are called for each of these events. This second part is the larger and more important part. I want to replace the parser but would like to migrate as much as possible of the second part callbacks as possible. Most of the work in the OOXML import/export project, however, will be spent in other areas: - Implementing features that exist in MS Office but not in OpenOffice. Examples are SmartArt shapes (for all applications). - Improve features in OpenOffice that are not working as well as they should/could. Examples are pivot tables in Calc or the slide show in Impress. - Support existing features in OpenOffice that are just not handled by the OOXML importer. Regarding POI, there are several reasons not to use it: - As said above, the existing import code is to be migrated to the new framework. The new framework should offer an interface that supports this migration. - POI is implemented in Java. - As far as I understand POI (I don't find its documentation very helpful) is more like a DOM tree with better access to its nodes then a streaming parser. That would result in lower execution speed and larger memory consumption. - OOXML / MS Office is supported up to 2007. That seems like an undesirable restriction. - The original naming (see http://en.wikipedia.org/wiki/Apache_POI) does not imply professional development of the POI project. Regards, Andre Regards, Andrea. - To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org For additional commands, e-mail: dev-h...@openoffice.apache.org - To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org For additional commands, e-mail: dev-h...@openoffice.apache.org
Re: [PROPOSAL] New OOXML import framework
On 19/05/2014 Andre Fischer wrote: As one of the first tasks in the OOXML area I would like to propose to redesign and re-implement the OOXML parser. I can only agree with this one. We've already discussed it many times, but even the many users who prefer ODF need a good support for OOXML for interoperability, and better support for the Microsoft Office native formats is consistently in the top requests. I propose a new and unified approach that will essentially replace the current design and implementation. Sounds good. Especially the idea to be able to automatically know how much of the specification is covered will be helpful. I also propose to focus first on Impress. Its complexity regarding OOXML is less than that of Writer and Calc And this is probably good for users too. In my experience, the import of .PPTX files is the most unsatisfactory one at the moment, with many obvious deficiencies. Improving this one first would already give good results for users. I have made several experiments regarding the reading of the specification and generation of parsers and am confident that the outlined approach will work. A not-so-original question: we have another Apache project, POI, http://poi.apache.org/ that among the other things has an OOXML parser. If we are starting from scratch, why not reusing their code? And, if there are reasons for not reusing it, could we validate this roadmap with the POI developers, who are probably more familiar with OOXML parsing than the average reader of this list? Regards, Andrea. - To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org For additional commands, e-mail: dev-h...@openoffice.apache.org
Re: [PROPOSAL] New OOXML import framework
On 05/19/2014 11:57 PM, Andre Fischer wrote: > On 20.05.2014 00:28, Kay Schenk wrote: >> [top posting for a moment] >> >> Thank you for this initial introduction to planning better support for >> OOXML. The reality is this is necessary, and I would imagine most >> involved in the project realize this. OK, just a bit more below. >> >> On 05/19/2014 06:39 AM, Andre Fischer wrote: >>> The compile time part of the framework is to be implemented in Java to >>> allow an efficient and fast development process. >> Does this basically mean that we will need to use both Java and C++ for >> future builds? > > We already need Java and C++ for builds. This does not change. > > -Andre > > OK, right. > - > To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org > For additional commands, e-mail: dev-h...@openoffice.apache.org > -- - MzK "Life is either a daring adventure, or nothing." -- Helen Keller - To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org For additional commands, e-mail: dev-h...@openoffice.apache.org
Re: [PROPOSAL] New OOXML import framework
On 20.05.2014 00:28, Kay Schenk wrote: [top posting for a moment] Thank you for this initial introduction to planning better support for OOXML. The reality is this is necessary, and I would imagine most involved in the project realize this. OK, just a bit more below. On 05/19/2014 06:39 AM, Andre Fischer wrote: The compile time part of the framework is to be implemented in Java to allow an efficient and fast development process. Does this basically mean that we will need to use both Java and C++ for future builds? We already need Java and C++ for builds. This does not change. -Andre - To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org For additional commands, e-mail: dev-h...@openoffice.apache.org
Re: [PROPOSAL] New OOXML import framework
[top posting for a moment] Thank you for this initial introduction to planning better support for OOXML. The reality is this is necessary, and I would imagine most involved in the project realize this. OK, just a bit more below. On 05/19/2014 06:39 AM, Andre Fischer wrote: > As one of the first tasks in the OOXML area I would like to propose to > redesign and re-implement the OOXML parser. > > At the moment each application has its own OOXML import design. Those of > Impress and Calc are basically classic hand written push parser designs > while that of Writer is semi-automatically derived from the > WordprocessingML specification. For all three designs there is hardly > any documentation and their implementation is hard to understand and > hard to maintain. All that means that you have to work hard to obtain a > working knowledge about the OOXML parser for one application and then, > once you have it, can not transfer it to the other applications. > > I propose a new and unified approach that will essentially replace the > current design and implementation. Using the same framework in all > applications has several advantages: > > - You only have to learn how to use one well documented framework > instead of three different and badly documented XML import techniques. > > - It exploits the information given by the OOXML schema to produce > automatically some of the code that has to be hand written today. > > - It allows automatic analysis of the coverage of the OOXML > specification so that we can easily see which parts have already been > implemented and which are still missing. > > - It will be much more easily understandable than the current OOXML > import (especially that of Writer). > > The one big downside is that the new design requires basically a > reimplementation of the OOXML import. But to everyone who has seen the > current implementation might not see that as a downside at all :-) > > > > Development and migration > > I propose to do the implementation in a new module (possibly called > main/ooxml/) with the goal to eventually (i.e. in a couple of releases) > replace main/oox/ and other places that contain OOXML import code. It > will not be active by default until every one agrees that it is release > ready. Of course, there will be switches to easily (but not > accidentally) activate it for development builds. > > I also propose to focus first on Impress. Its complexity regarding > OOXML is less than that of Writer and Calc and the still existing > expertise in this area of OpenOffice is probably larger than in Writer > and definitely larger than in Calc. > > Development will start with implementation of the new framework that is > hinted at above and explained in more detail below. Then the existing > Impress import is migrated to the new design by copying and adapting the > code. The existing import in main/oox/ remains unchanged. > > > > The new framework > > The design of the new framework is based on exploiting the OOXML > specification (plural because there are different versions, migration > addendums and MS Office specific extensions). A parser generator reads > the specs and creates the actual OOXML parser from that. The generated > parser will basically be a (nested) stack automaton where each state > corresponds roughly to a complex type as defined by the spec. > Transitions from on state to another correspond to start and end tags > that move from one complex type to another. > > The actions that are executed on transitions and which do the actual > import work, still have to be provided manually. With an intermediate > DSL (domain specific language) that represents the interface between > OOXML parser and developer, even this step will become more easy and > more robust. > > The use of an intermediate DSL also allows tweaking of the rules derived > from the OOXML specification should the need arise (to e.g. cope with > OOXML files that are not 100% conformant to the specs). > > The compile time part of the framework is to be implemented in Java to > allow an efficient and fast development process. Does this basically mean that we will need to use both Java and C++ for future builds? The runtime part of > the framework, including the generated parser will be implemented in C++ > and be an integral part of OpenOffice. > > > > Details > > At the moment we are using a bare bones XML push parser for reading > OOXML files. That means that as the XML parser reads the stream of XML > elements it asks the OOXML import code to handle start tags, end tags, > and the text in between. It is the task of these callbacks to provide > so called contexts for each element. These contexts can then be used to > make information like attribute values (which the parser only provides > to start tags) accessible to the callbacks of text and end tags. > The creation of contexts and persistence of intermediate data is done > manually in the existing import cod
[PROPOSAL] New OOXML import framework
As one of the first tasks in the OOXML area I would like to propose to redesign and re-implement the OOXML parser. At the moment each application has its own OOXML import design. Those of Impress and Calc are basically classic hand written push parser designs while that of Writer is semi-automatically derived from the WordprocessingML specification. For all three designs there is hardly any documentation and their implementation is hard to understand and hard to maintain. All that means that you have to work hard to obtain a working knowledge about the OOXML parser for one application and then, once you have it, can not transfer it to the other applications. I propose a new and unified approach that will essentially replace the current design and implementation. Using the same framework in all applications has several advantages: - You only have to learn how to use one well documented framework instead of three different and badly documented XML import techniques. - It exploits the information given by the OOXML schema to produce automatically some of the code that has to be hand written today. - It allows automatic analysis of the coverage of the OOXML specification so that we can easily see which parts have already been implemented and which are still missing. - It will be much more easily understandable than the current OOXML import (especially that of Writer). The one big downside is that the new design requires basically a reimplementation of the OOXML import. But to everyone who has seen the current implementation might not see that as a downside at all :-) Development and migration I propose to do the implementation in a new module (possibly called main/ooxml/) with the goal to eventually (i.e. in a couple of releases) replace main/oox/ and other places that contain OOXML import code. It will not be active by default until every one agrees that it is release ready. Of course, there will be switches to easily (but not accidentally) activate it for development builds. I also propose to focus first on Impress. Its complexity regarding OOXML is less than that of Writer and Calc and the still existing expertise in this area of OpenOffice is probably larger than in Writer and definitely larger than in Calc. Development will start with implementation of the new framework that is hinted at above and explained in more detail below. Then the existing Impress import is migrated to the new design by copying and adapting the code. The existing import in main/oox/ remains unchanged. The new framework The design of the new framework is based on exploiting the OOXML specification (plural because there are different versions, migration addendums and MS Office specific extensions). A parser generator reads the specs and creates the actual OOXML parser from that. The generated parser will basically be a (nested) stack automaton where each state corresponds roughly to a complex type as defined by the spec. Transitions from on state to another correspond to start and end tags that move from one complex type to another. The actions that are executed on transitions and which do the actual import work, still have to be provided manually. With an intermediate DSL (domain specific language) that represents the interface between OOXML parser and developer, even this step will become more easy and more robust. The use of an intermediate DSL also allows tweaking of the rules derived from the OOXML specification should the need arise (to e.g. cope with OOXML files that are not 100% conformant to the specs). The compile time part of the framework is to be implemented in Java to allow an efficient and fast development process. The runtime part of the framework, including the generated parser will be implemented in C++ and be an integral part of OpenOffice. Details At the moment we are using a bare bones XML push parser for reading OOXML files. That means that as the XML parser reads the stream of XML elements it asks the OOXML import code to handle start tags, end tags, and the text in between. It is the task of these callbacks to provide so called contexts for each element. These contexts can then be used to make information like attribute values (which the parser only provides to start tags) accessible to the callbacks of text and end tags. The creation of contexts and persistence of intermediate data is done manually in the existing import code. The new import framework, however, will create it automatically, based on the OOXML specifications and semi automatically based on DSL requests. The automatic part is extracted from the specs and responsible for preprocessing attribute value (e.g. conversion from string to boolean, integer, float/double or enumerations). The semi automatic part is driven by developer supplied information in DSL files and defines the subset of attributes that are really evaluated by the impor