ODF filter

Peter Kelly Wed, 07 Jan 2015 05:59:37 -0800

I mentioned in my last mail the topic of writing an ODF filter. I realise the 
codebase is pretty difficult to navigate right now due to lack of 
documentation, so I thought I’d get the discussion started by outlining how I 
would suggest we proceed with this, based on my experience writing the Word 
filter (I tend to use the term “Word” rather than OOXML, since the currently 
implementation details only with the word processing subset of the spec; 
similarly for ODF for now).


At a high-level, each filter needs to provide three operations: get, put, and 
create. These operate on “abstract” and “concrete” documents - an abstract 
document is in HTML format (our common intermediate representation) and the 
concrete document is in format which the filter is implementing (in this case, 
.odt).

The get operation will need to convert from ODT to HTML, and include id 
attributes in the HTML file that allow elements in the latter to be correlated 
with elements in the former. In the Word filter, the ids are based on the index 
of the node in a pre-order traversal of the tree. These are used to look up 
elements during the put operation, so we know which element to update.

The put operation will need to accept an existing ODT document, and update it 
based on a modified version of the HTML file that was previously obtained from 
the get operation. The way I did this in the word filter was to traverse both 
trees in “parallel”, determining what had changed (and using the element 
mappings based on id attributes), making changes to the original document as 
appropriate. In the case of formatting attributes, this involved re-generating 
the CSS from the concrete document, comparing which attributes had changed, and 
then applying the necessary changes to the formatting elements in the concrete 
document. In the case of content, this was handled differently, generally 
simply overwriting.

During traversal, the functions in DFBDT.c can be used to handle case where the 
children of a given element have been re-ordered (e.g. someone moved a 
paragraph to different position in the document). This uses the id mappings in 
the HTML to figure out what elements in the concrete document they correspond 
to, and when it sees them in a different order, it moves some of them so that 
they come to match the order in which the corresponding HTML elements appear. 
Unsupported elements are left untouched by this process.

The create operation will need to produce a brand new ODT file based on a HTML 
file. This can simply be implemented by creating an empty ODT file, and then 
doing a put operation - it’s essentially “updating” an empty document to which 
new content has been added.

The entry points for these three functions are DFGet, DFPut, and DFCreate in 
api/src/Operations.c. These each have a switch statement which looks at the 
file type and calls through to a function in the appropriate filter to do the 
conversion. In the future we may need a more generic/pluggable way of doing 
this, but for the time being, defining three functions ODTGet, ODTPut, and 
ODTCreate (corresponding to the existing WordGet, WordPut, and WordCreate 
functions) and adding cases to the switch statements for these will be 
sufficient.

It’s probably best to start off by having a look at these functions in 
filters/ooxml/src/word/Word.c and following the code through there. If you’re 
using Xcode, you can easily jump through the function call graph to go to the 
implementation of a called function; I expect visual studio probably has 
something similar. At any rate, I’ve mostly chosen function names that are not 
prefixes of other function names, so it should be fairly easy to find the 
function you’re looking for with grep if you don’t know what file it’s in (this 
is something I love about C, which you can’t do so easily using object-oriented 
languages).

The Word filter has two core classes used during conversion - WordPackage and 
WordConverter (defined in their respective .h and .c files). A word package 
encapsulates a .docx file, and contains data structures loaded from the XML 
files stored within the .docx package (which is actually a zip file). There are 
classes for things like the stylesheet, numbering information, the set of 
footnotes/endnotes, and so forth. For ODF,  I already did a little bit of work 
a while back defining skeleton versions of the corresponding classes 
(ODFPackage, ODFManifest, and ODFSheet). The file ODF.c is empty but would be a 
suitable place to put the get/put/create functions.

Data structures used in ODF differ somewhat from those of Word documents, 
though there is a lot of conceptual similarity. The most significant difference 
I can think of is the way that direct formatting is handled - ODF treats 
*everything* as a style; if you apply direct formatting to a run of text, then 
it creates what’s called an “automatic style” and references that from the 
content. So styles, formatting, numbering, and numerous other things will have 
to be represented differently, but much of the strategies used in the word 
filter should carry across fairly easily. I need to document these better, but 
perhaps it’s easiest if you get stuck to ask me questions, and then we can put 
these on the wiki or in the source documentation.

Anyway, this is just a braindump of what I think the most relevant things 
someone implementing an ODF filter will need to know. I’d love to be be 
pestered with more questions about this, as I think getting started on this 
important task would be a good step forward for the project, and demonstrate 
our commitment to making interoperability easier for people.

—
Dr Peter M. Kelly
[email protected]

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

ODF filter

Reply via email to