I mentioned in my last mail the topic of writing an ODF filter. I realise the codebase is pretty difficult to navigate right now due to lack of documentation, so I thought I’d get the discussion started by outlining how I would suggest we proceed with this, based on my experience writing the Word filter (I tend to use the term “Word” rather than OOXML, since the currently implementation details only with the word processing subset of the spec; similarly for ODF for now).
At a high-level, each filter needs to provide three operations: get, put, and create. These operate on “abstract” and “concrete” documents - an abstract document is in HTML format (our common intermediate representation) and the concrete document is in format which the filter is implementing (in this case, .odt). The get operation will need to convert from ODT to HTML, and include id attributes in the HTML file that allow elements in the latter to be correlated with elements in the former. In the Word filter, the ids are based on the index of the node in a pre-order traversal of the tree. These are used to look up elements during the put operation, so we know which element to update. The put operation will need to accept an existing ODT document, and update it based on a modified version of the HTML file that was previously obtained from the get operation. The way I did this in the word filter was to traverse both trees in “parallel”, determining what had changed (and using the element mappings based on id attributes), making changes to the original document as appropriate. In the case of formatting attributes, this involved re-generating the CSS from the concrete document, comparing which attributes had changed, and then applying the necessary changes to the formatting elements in the concrete document. In the case of content, this was handled differently, generally simply overwriting. During traversal, the functions in DFBDT.c can be used to handle case where the children of a given element have been re-ordered (e.g. someone moved a paragraph to different position in the document). This uses the id mappings in the HTML to figure out what elements in the concrete document they correspond to, and when it sees them in a different order, it moves some of them so that they come to match the order in which the corresponding HTML elements appear. Unsupported elements are left untouched by this process. The create operation will need to produce a brand new ODT file based on a HTML file. This can simply be implemented by creating an empty ODT file, and then doing a put operation - it’s essentially “updating” an empty document to which new content has been added. The entry points for these three functions are DFGet, DFPut, and DFCreate in api/src/Operations.c. These each have a switch statement which looks at the file type and calls through to a function in the appropriate filter to do the conversion. In the future we may need a more generic/pluggable way of doing this, but for the time being, defining three functions ODTGet, ODTPut, and ODTCreate (corresponding to the existing WordGet, WordPut, and WordCreate functions) and adding cases to the switch statements for these will be sufficient. It’s probably best to start off by having a look at these functions in filters/ooxml/src/word/Word.c and following the code through there. If you’re using Xcode, you can easily jump through the function call graph to go to the implementation of a called function; I expect visual studio probably has something similar. At any rate, I’ve mostly chosen function names that are not prefixes of other function names, so it should be fairly easy to find the function you’re looking for with grep if you don’t know what file it’s in (this is something I love about C, which you can’t do so easily using object-oriented languages). The Word filter has two core classes used during conversion - WordPackage and WordConverter (defined in their respective .h and .c files). A word package encapsulates a .docx file, and contains data structures loaded from the XML files stored within the .docx package (which is actually a zip file). There are classes for things like the stylesheet, numbering information, the set of footnotes/endnotes, and so forth. For ODF, I already did a little bit of work a while back defining skeleton versions of the corresponding classes (ODFPackage, ODFManifest, and ODFSheet). The file ODF.c is empty but would be a suitable place to put the get/put/create functions. Data structures used in ODF differ somewhat from those of Word documents, though there is a lot of conceptual similarity. The most significant difference I can think of is the way that direct formatting is handled - ODF treats *everything* as a style; if you apply direct formatting to a run of text, then it creates what’s called an “automatic style” and references that from the content. So styles, formatting, numbering, and numerous other things will have to be represented differently, but much of the strategies used in the word filter should carry across fairly easily. I need to document these better, but perhaps it’s easiest if you get stuck to ask me questions, and then we can put these on the wiki or in the source documentation. Anyway, this is just a braindump of what I think the most relevant things someone implementing an ODF filter will need to know. I’d love to be be pestered with more questions about this, as I think getting started on this important task would be a good step forward for the project, and demonstrate our commitment to making interoperability easier for people. — Dr Peter M. Kelly [email protected] PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
