RE: OOXML

Dennis E. Hamilton Mon, 04 Aug 2014 09:02:57 -0700

It is important to understand that an XML DOM does not capture all of the 
constraints and referential requirements within an ODF document.  In 
particular, content.xml does not have everything and there are references using 
XLink (relative hrefs) and also special identifiers (not IDREFs) to other 
files, whether for binary attachments or into other defined parts (styles.xml 
and meta.xml for two).


There is also considerable internal structuring that is off-hierachy.  Some of 
the connections are via fragment IDs (xml:id) and IDREFs, others are by 
identifiers (not IDs and IDREFs) that are introduced in the ODF specification 
but which are not modelled in the Relax NG Schema (beyond saying they have 
string values, for example).

This sort of thing also happens rather heavily in OOXML, where communication 
among parts uses a unique cross-part relationship model.  There are also many 
cross references to named components by other than XML IDs and IDREFs, whether 
or not the components and the references occur in the same part of the OPC 
package.

One could continue the kind of hack that plants that information as benign 
markers into an internal form of the XML parts (even as a single XML document, 
although that is tricky when ODF documents are nested as subdocuments of 
another), so long as they are replaced when the XML document is committed to a 
saved ODF document file format.

In terms of having a DOM that maps to the external file form and a different 
internal model, the only time that the internal model needs to update the 
externally-oriented DOM is as part of a Save operation.  There might be more 
coupling, but performance and storage issues will doubtless impact the 
engineering outcome, especially for handling large documents with alacrity.  
Copy and paste and undo management will also be factors, along with maintaining 
pagination, word counts, and such.

On the other hand, it is convenient (practically necessary) to specify the 
semantics of ODF, or some profile of ODF, as if operations are on the format 
itself, since it is only the format that is more-or-less well-specified.  It 
would be interesting to know how much this could be taken literally in an 
application.  I think there might be forensic tools on ODF documents that might 
be able to operate that way.  I'm not at all certain about production WYSIWYG 
consumers and producers, especially ones implemented to harmonize between 
OOXML, ODF and other interesting formats (EPUB coming to mind).

I will watch Peter Kelly's efforts with great interest to see how much the 
boundaries can be moved in this area.


 -- Dennis E. Hamilton
    dennis.hamil...@acm.org    +1-206-779-9430
    https://keybase.io/orcmid  PGP F96E 89FF D456 628A
    X.509 certs used and requested for signed e-mail


 ----- Original Message ---
From: Peter Kelly [mailto:kelly...@gmail.com] 
Sent: Monday, August 4, 2014 01:27
To: dev@openoffice.apache.org
Subject: Re: OOXML

On 4 Aug 2014, at 12:16 am, jan i <j...@apache.org> wrote:


[ ... ]

It's possible in theory, though I'm not familiar enough with the OO codebase to 
say whether it would work in practice.

The key idea is to maintain two separate data structures - one which is the ODF 
XML trees, and another which is the internal representation. Any time a change 
gets made to the former, the implementation must update the latter to reflect 
the change. Modification operations on the latter would need to go in the other 
direction.

[ ... ]

In the case of UX Write, there's a few instances where I've used custom 
extensions to handle certain things. The main ones are:

1. Table of contents/list of tables/list of figures.

When you insert one of these into your document, it inserts a <nav> element 
with a CSS class name of "tableofcontents", "listoffigures", or "listoftables", 
which were chosen as these are the same keywords that LaTeX uses for these 
features. UX Write treats these as having special meaning, in the sense that 
when opening a document (and when the document is modified), it updates the 
content of these <nav> elements based on the set of all heading, figure, or 
table elements in the document (including numbering/captions).

2. OOXML-specific features.

When converting from .docx to .html during the process of opening a document, 
it assigns certain pre-defined CSS class names to particular types of HTML 
elements to indicate their purpose. For example, a cross-reference whose 
display format is supposed to include both the label and caption of a figure 
will be translated as:

[ ... ]



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org
For additional commands, e-mail: dev-h...@openoffice.apache.org

RE: OOXML

Reply via email to