Re: [PROPOSAL] New OOXML import framework

2014-05-21 Thread Andre Fischer

On 20.05.2014 23:38, Andrea Pescetti wrote:

On 19/05/2014 Andre Fischer wrote:

As one of the first tasks in the OOXML area I would like to propose to
redesign and re-implement the OOXML parser.


I can only agree with this one. We've already discussed it many times, 
but even the many users who prefer ODF need a good support for OOXML 
for interoperability, and better support for the Microsoft Office 
native formats is consistently in the top requests.



I propose a new and unified approach that will essentially replace the
current design and implementation.


Sounds good. Especially the idea to be able to automatically know how 
much of the specification is covered will be helpful.



I also propose to focus first on Impress. Its complexity regarding
OOXML is less than that of Writer and Calc


And this is probably good for users too. In my experience, the import 
of .PPTX files is the most unsatisfactory one at the moment, with many 
obvious deficiencies. Improving this one first would already give good 
results for users.



I have made several experiments regarding the reading of the
specification and generation of parsers and am confident that the
outlined approach will work.


A not-so-original question: we have another Apache project, POI, 
http://poi.apache.org/ that among the other things has an OOXML 
parser. If we are starting from scratch, why not reusing their code? 
And, if there are reasons for not reusing it, could we validate this 
roadmap with the POI developers, who are probably more familiar with 
OOXML parsing than the average reader of this list?


First, we are not really starting from scratch.  There are several 
components to importing OOXML files.  Two important ones are the parser 
that reads (OO)XML streams and turns them into events for start tags, 
end tags, text, etc.  The second  part are the callbacks that are called 
for each of these events.  This second part is the larger and more 
important part.  I want to replace the parser but would like to migrate 
as much as possible of the second part callbacks as possible.


Most of the work in the OOXML import/export project, however, will be 
spent in other areas:


- Implementing features that exist in MS Office but not in OpenOffice.  
Examples are SmartArt shapes (for all applications).


- Improve features in OpenOffice that are not working as well as they 
should/could.  Examples are pivot tables in Calc or the slide show in 
Impress.


- Support existing features in OpenOffice that are just not handled by 
the OOXML importer.



Regarding POI, there are several reasons not to use it:

- As said above, the existing import code is to be migrated to the new 
framework.   The new framework should offer an interface that supports 
this migration.


- POI is implemented in Java.

- As far as I understand POI (I don't find its documentation very 
helpful) is more like a DOM tree with better access to its nodes then a 
streaming parser.  That would result in lower execution speed and larger 
memory consumption.


- OOXML / MS Office is supported up to 2007.   That seems like an 
undesirable restriction.


- The original naming (see http://en.wikipedia.org/wiki/Apache_POI) does 
not imply professional development of the POI project.



Regards,
Andre




Regards,
  Andrea.

-
To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org
For additional commands, e-mail: dev-h...@openoffice.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org
For additional commands, e-mail: dev-h...@openoffice.apache.org



Re: [PROPOSAL] New OOXML import framework

2014-05-20 Thread Andrea Pescetti

On 19/05/2014 Andre Fischer wrote:

As one of the first tasks in the OOXML area I would like to propose to
redesign and re-implement the OOXML parser.


I can only agree with this one. We've already discussed it many times, 
but even the many users who prefer ODF need a good support for OOXML for 
interoperability, and better support for the Microsoft Office native 
formats is consistently in the top requests.



I propose a new and unified approach that will essentially replace the
current design and implementation.


Sounds good. Especially the idea to be able to automatically know how 
much of the specification is covered will be helpful.



I also propose to focus first on Impress.  Its complexity regarding
OOXML is less than that of Writer and Calc


And this is probably good for users too. In my experience, the import of 
.PPTX files is the most unsatisfactory one at the moment, with many 
obvious deficiencies. Improving this one first would already give good 
results for users.



I have made several experiments regarding the reading of the
specification and generation of parsers and am confident that the
outlined approach will work.


A not-so-original question: we have another Apache project, POI, 
http://poi.apache.org/ that among the other things has an OOXML parser. 
If we are starting from scratch, why not reusing their code? And, if 
there are reasons for not reusing it, could we validate this roadmap 
with the POI developers, who are probably more familiar with OOXML 
parsing than the average reader of this list?


Regards,
  Andrea.

-
To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org
For additional commands, e-mail: dev-h...@openoffice.apache.org



Re: [PROPOSAL] New OOXML import framework

2014-05-20 Thread Kay Schenk


On 05/19/2014 11:57 PM, Andre Fischer wrote:
> On 20.05.2014 00:28, Kay Schenk wrote:
>> [top posting for a moment]
>>
>> Thank you for this initial introduction to planning better support for
>> OOXML. The reality is this is necessary, and I would imagine most
>> involved in the project realize this.  OK, just a bit more below.
>>
>> On 05/19/2014 06:39 AM, Andre Fischer wrote:
>>> The compile time part of the framework is to be implemented in Java to
>>> allow an efficient and fast development process.
>> Does this basically mean that we will need to use both Java and C++ for
>> future builds?
> 
> We already need Java and C++ for builds.  This does not change.
> 
> -Andre
> 
> 

OK, right.

> -
> To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org
> For additional commands, e-mail: dev-h...@openoffice.apache.org
> 

-- 
-
MzK

"Life is either a daring adventure, or nothing."
   -- Helen Keller


-
To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org
For additional commands, e-mail: dev-h...@openoffice.apache.org



Re: [PROPOSAL] New OOXML import framework

2014-05-19 Thread Andre Fischer

On 20.05.2014 00:28, Kay Schenk wrote:

[top posting for a moment]

Thank you for this initial introduction to planning better support for
OOXML. The reality is this is necessary, and I would imagine most
involved in the project realize this.  OK, just a bit more below.

On 05/19/2014 06:39 AM, Andre Fischer wrote:

The compile time part of the framework is to be implemented in Java to
allow an efficient and fast development process.

Does this basically mean that we will need to use both Java and C++ for
future builds?


We already need Java and C++ for builds.  This does not change.

-Andre


-
To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org
For additional commands, e-mail: dev-h...@openoffice.apache.org



Re: [PROPOSAL] New OOXML import framework

2014-05-19 Thread Kay Schenk
[top posting for a moment]

Thank you for this initial introduction to planning better support for
OOXML. The reality is this is necessary, and I would imagine most
involved in the project realize this.  OK, just a bit more below.

On 05/19/2014 06:39 AM, Andre Fischer wrote:
> As one of the first tasks in the OOXML area I would like to propose to
> redesign and re-implement the OOXML parser.
> 
> At the moment each application has its own OOXML import design. Those of
> Impress and Calc are basically classic hand written push parser designs
> while that of Writer is semi-automatically derived from the
> WordprocessingML specification.  For all three designs there is hardly
> any documentation and their implementation is hard to understand and
> hard to maintain. All that means that you have to work hard to obtain a
> working knowledge about the OOXML parser for one application and then,
> once you have it, can not transfer it to the other applications.
> 
> I propose a new and unified approach that will essentially replace the
> current design and implementation.  Using the same framework in all
> applications has several advantages:
> 
> - You only have to learn how to use one well documented framework
> instead of three different and badly documented XML import techniques.
> 
> - It exploits the information given by the OOXML schema to produce
> automatically some of the code that has to be hand written today.
> 
> - It allows automatic analysis of the coverage of the OOXML
> specification so that we can easily see which parts have already been
> implemented and which are still missing.
> 
> - It will be much more easily understandable than the current OOXML
> import (especially that of Writer).
> 
> The one big downside is that the new design requires basically a
> reimplementation of the OOXML import.  But to everyone who has seen the
> current implementation might not see that as a downside at all :-)
> 
> 
> 
> Development and migration
> 
> I propose to do the implementation in a new module (possibly called
> main/ooxml/) with the goal to eventually (i.e. in a couple of releases)
> replace main/oox/ and other places that contain OOXML import code.  It
> will not be active by default until every one agrees that it is release
> ready.  Of course, there will be switches to easily (but not
> accidentally) activate it for development builds.
> 
> I also propose to focus first on Impress.  Its complexity regarding
> OOXML is less than that of Writer and Calc and the still existing
> expertise in this area of OpenOffice is probably larger than in Writer
> and definitely larger than in Calc.
> 
> Development will start with implementation of the new framework that is
> hinted at above and explained in more detail below.  Then the existing
> Impress import is migrated to the new design by copying and adapting the
> code.  The existing import in main/oox/ remains unchanged.
> 
> 
> 
> The new framework
> 
> The design of the new framework is based on exploiting the OOXML
> specification (plural because there are different versions, migration
> addendums and MS Office specific extensions).  A parser generator reads
> the specs and creates the actual OOXML parser from that.  The generated
> parser will basically be a (nested) stack automaton where each state
> corresponds roughly to a complex type as defined by the spec. 
> Transitions from on state to another correspond to start and end tags
> that move from one complex type to another.
> 
> The actions that are executed on transitions and which do the actual
> import work, still have to be provided manually.  With an intermediate
> DSL (domain specific language) that represents the interface between
> OOXML parser and developer, even this step will become more easy and
> more robust.
> 
> The use of an intermediate DSL also allows tweaking of the rules derived
> from the OOXML specification should the need arise (to e.g. cope with
> OOXML files that are not 100% conformant to the specs).
> 
> The compile time part of the framework is to be implemented in Java to
> allow an efficient and fast development process. 

Does this basically mean that we will need to use both Java and C++ for
future builds?

 The runtime part of
> the framework, including the generated parser will be implemented in C++
> and be an integral part of OpenOffice.
> 
> 
> 
> Details
> 
> At the moment we are using a bare bones XML push parser for reading
> OOXML files.  That means that as the XML parser reads the stream of XML
> elements it asks the OOXML import code to handle start tags, end tags,
> and the text in between.  It is the task of these callbacks to provide
> so called contexts for each element. These contexts can then be used to
> make information like attribute values (which the parser only provides
> to start tags) accessible to the callbacks of text and end tags.
> The creation of contexts and persistence of intermediate data is done
> manually in the existing import cod

[PROPOSAL] New OOXML import framework

2014-05-19 Thread Andre Fischer
As one of the first tasks in the OOXML area I would like to propose to 
redesign and re-implement the OOXML parser.


At the moment each application has its own OOXML import design. Those of 
Impress and Calc are basically classic hand written push parser designs 
while that of Writer is semi-automatically derived from the 
WordprocessingML specification.  For all three designs there is hardly 
any documentation and their implementation is hard to understand and 
hard to maintain. All that means that you have to work hard to obtain a 
working knowledge about the OOXML parser for one application and then, 
once you have it, can not transfer it to the other applications.


I propose a new and unified approach that will essentially replace the 
current design and implementation.  Using the same framework in all 
applications has several advantages:


- You only have to learn how to use one well documented framework 
instead of three different and badly documented XML import techniques.


- It exploits the information given by the OOXML schema to produce 
automatically some of the code that has to be hand written today.


- It allows automatic analysis of the coverage of the OOXML 
specification so that we can easily see which parts have already been 
implemented and which are still missing.


- It will be much more easily understandable than the current OOXML 
import (especially that of Writer).


The one big downside is that the new design requires basically a 
reimplementation of the OOXML import.  But to everyone who has seen the 
current implementation might not see that as a downside at all :-)




Development and migration

I propose to do the implementation in a new module (possibly called 
main/ooxml/) with the goal to eventually (i.e. in a couple of releases) 
replace main/oox/ and other places that contain OOXML import code.  It 
will not be active by default until every one agrees that it is release 
ready.  Of course, there will be switches to easily (but not 
accidentally) activate it for development builds.


I also propose to focus first on Impress.  Its complexity regarding 
OOXML is less than that of Writer and Calc and the still existing 
expertise in this area of OpenOffice is probably larger than in Writer 
and definitely larger than in Calc.


Development will start with implementation of the new framework that is 
hinted at above and explained in more detail below.  Then the existing 
Impress import is migrated to the new design by copying and adapting the 
code.  The existing import in main/oox/ remains unchanged.




The new framework

The design of the new framework is based on exploiting the OOXML 
specification (plural because there are different versions, migration 
addendums and MS Office specific extensions).  A parser generator reads 
the specs and creates the actual OOXML parser from that.  The generated 
parser will basically be a (nested) stack automaton where each state 
corresponds roughly to a complex type as defined by the spec.  
Transitions from on state to another correspond to start and end tags 
that move from one complex type to another.


The actions that are executed on transitions and which do the actual 
import work, still have to be provided manually.  With an intermediate 
DSL (domain specific language) that represents the interface between 
OOXML parser and developer, even this step will become more easy and 
more robust.


The use of an intermediate DSL also allows tweaking of the rules derived 
from the OOXML specification should the need arise (to e.g. cope with 
OOXML files that are not 100% conformant to the specs).


The compile time part of the framework is to be implemented in Java to 
allow an efficient and fast development process.  The runtime part of 
the framework, including the generated parser will be implemented in C++ 
and be an integral part of OpenOffice.




Details

At the moment we are using a bare bones XML push parser for reading 
OOXML files.  That means that as the XML parser reads the stream of XML 
elements it asks the OOXML import code to handle start tags, end tags, 
and the text in between.  It is the task of these callbacks to provide 
so called contexts for each element. These contexts can then be used to 
make information like attribute values (which the parser only provides 
to start tags) accessible to the callbacks of text and end tags.
The creation of contexts and persistence of intermediate data is done 
manually in the existing import code.  The new import framework, 
however, will create it automatically, based on the OOXML specifications 
and semi automatically based on DSL requests.  The automatic part is 
extracted from the specs and responsible for preprocessing attribute 
value (e.g. conversion from string to boolean, integer, float/double or 
enumerations). The semi automatic part is driven by developer supplied 
information in DSL files and defines the subset of attributes that are 
really evaluated by the impor