Hi Peter, This is a helpful email from your concrete discussion I can better understand the mapping between the abstract / HTML model and the concrete / DOCX, ODT.
You mention differences in the style runs for Word and ODT of which I am familiar from the OOXML side. Does the abstract model / HTML take a particular approach towards style runs? Is there a concrete version of the HTML model? Is there a specification or plan for the abstract model? I also think that one approach towards other file format filters that could be interesting would be to focus on PUT functionality before GET. Understanding how to write a proper document is the first step towards reading documents in all of the historical variations. PDF is a classic example of this. Adobe has always done well defining what a valid PDF document looks like, but there are after 24 years myriad variants that are valid. Regards, Dave On Jan 7, 2015, at 5:57 AM, Peter Kelly wrote: > I mentioned in my last mail the topic of writing an ODF filter. I realise the > codebase is pretty difficult to navigate right now due to lack of > documentation, so I thought I’d get the discussion started by outlining how I > would suggest we proceed with this, based on my experience writing the Word > filter (I tend to use the term “Word” rather than OOXML, since the currently > implementation details only with the word processing subset of the spec; > similarly for ODF for now). > > At a high-level, each filter needs to provide three operations: get, put, and > create. These operate on “abstract” and “concrete” documents - an abstract > document is in HTML format (our common intermediate representation) and the > concrete document is in format which the filter is implementing (in this > case, .odt). > > The get operation will need to convert from ODT to HTML, and include id > attributes in the HTML file that allow elements in the latter to be > correlated with elements in the former. In the Word filter, the ids are based > on the index of the node in a pre-order traversal of the tree. These are used > to look up elements during the put operation, so we know which element to > update. > > The put operation will need to accept an existing ODT document, and update it > based on a modified version of the HTML file that was previously obtained > from the get operation. The way I did this in the word filter was to traverse > both trees in “parallel”, determining what had changed (and using the element > mappings based on id attributes), making changes to the original document as > appropriate. In the case of formatting attributes, this involved > re-generating the CSS from the concrete document, comparing which attributes > had changed, and then applying the necessary changes to the formatting > elements in the concrete document. In the case of content, this was handled > differently, generally simply overwriting. > > During traversal, the functions in DFBDT.c can be used to handle case where > the children of a given element have been re-ordered (e.g. someone moved a > paragraph to different position in the document). This uses the id mappings > in the HTML to figure out what elements in the concrete document they > correspond to, and when it sees them in a different order, it moves some of > them so that they come to match the order in which the corresponding HTML > elements appear. Unsupported elements are left untouched by this process. > > The create operation will need to produce a brand new ODT file based on a > HTML file. This can simply be implemented by creating an empty ODT file, and > then doing a put operation - it’s essentially “updating” an empty document to > which new content has been added. > > The entry points for these three functions are DFGet, DFPut, and DFCreate in > api/src/Operations.c. These each have a switch statement which looks at the > file type and calls through to a function in the appropriate filter to do the > conversion. In the future we may need a more generic/pluggable way of doing > this, but for the time being, defining three functions ODTGet, ODTPut, and > ODTCreate (corresponding to the existing WordGet, WordPut, and WordCreate > functions) and adding cases to the switch statements for these will be > sufficient. > > It’s probably best to start off by having a look at these functions in > filters/ooxml/src/word/Word.c and following the code through there. If you’re > using Xcode, you can easily jump through the function call graph to go to the > implementation of a called function; I expect visual studio probably has > something similar. At any rate, I’ve mostly chosen function names that are > not prefixes of other function names, so it should be fairly easy to find the > function you’re looking for with grep if you don’t know what file it’s in > (this is something I love about C, which you can’t do so easily using > object-oriented languages). > > The Word filter has two core classes used during conversion - WordPackage and > WordConverter (defined in their respective .h and .c files). A word package > encapsulates a .docx file, and contains data structures loaded from the XML > files stored within the .docx package (which is actually a zip file). There > are classes for things like the stylesheet, numbering information, the set of > footnotes/endnotes, and so forth. For ODF, I already did a little bit of > work a while back defining skeleton versions of the corresponding classes > (ODFPackage, ODFManifest, and ODFSheet). The file ODF.c is empty but would be > a suitable place to put the get/put/create functions. > > Data structures used in ODF differ somewhat from those of Word documents, > though there is a lot of conceptual similarity. The most significant > difference I can think of is the way that direct formatting is handled - ODF > treats *everything* as a style; if you apply direct formatting to a run of > text, then it creates what’s called an “automatic style” and references that > from the content. So styles, formatting, numbering, and numerous other things > will have to be represented differently, but much of the strategies used in > the word filter should carry across fairly easily. I need to document these > better, but perhaps it’s easiest if you get stuck to ask me questions, and > then we can put these on the wiki or in the source documentation. > > Anyway, this is just a braindump of what I think the most relevant things > someone implementing an ODF filter will need to know. I’d love to be be > pestered with more questions about this, as I think getting started on this > important task would be a good step forward for the project, and demonstrate > our commitment to making interoperability easier for people. > > — > Dr Peter M. Kelly > [email protected] > > PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key> > (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966) >
