Hi Peter,

This is a helpful email from your concrete discussion I can better understand 
the mapping between the abstract / HTML model and the concrete / DOCX, ODT.

You mention differences in the style runs for Word and ODT of which I am 
familiar from the OOXML side. Does the abstract model / HTML take a particular 
approach towards style runs? Is there a concrete version of the HTML model? Is 
there a specification or plan for the abstract model?

I also think that one approach towards other file format filters that could be 
interesting would be to focus on PUT functionality before GET. Understanding 
how to write a proper document is the first step towards reading documents in 
all of the historical variations. PDF is a classic example of this. Adobe has 
always done well defining what a valid PDF document looks like, but there are 
after 24 years myriad variants that are valid.

Regards,
Dave

On Jan 7, 2015, at 5:57 AM, Peter Kelly wrote:

> I mentioned in my last mail the topic of writing an ODF filter. I realise the 
> codebase is pretty difficult to navigate right now due to lack of 
> documentation, so I thought I’d get the discussion started by outlining how I 
> would suggest we proceed with this, based on my experience writing the Word 
> filter (I tend to use the term “Word” rather than OOXML, since the currently 
> implementation details only with the word processing subset of the spec; 
> similarly for ODF for now).
> 
> At a high-level, each filter needs to provide three operations: get, put, and 
> create. These operate on “abstract” and “concrete” documents - an abstract 
> document is in HTML format (our common intermediate representation) and the 
> concrete document is in format which the filter is implementing (in this 
> case, .odt).
> 
> The get operation will need to convert from ODT to HTML, and include id 
> attributes in the HTML file that allow elements in the latter to be 
> correlated with elements in the former. In the Word filter, the ids are based 
> on the index of the node in a pre-order traversal of the tree. These are used 
> to look up elements during the put operation, so we know which element to 
> update.
> 
> The put operation will need to accept an existing ODT document, and update it 
> based on a modified version of the HTML file that was previously obtained 
> from the get operation. The way I did this in the word filter was to traverse 
> both trees in “parallel”, determining what had changed (and using the element 
> mappings based on id attributes), making changes to the original document as 
> appropriate. In the case of formatting attributes, this involved 
> re-generating the CSS from the concrete document, comparing which attributes 
> had changed, and then applying the necessary changes to the formatting 
> elements in the concrete document. In the case of content, this was handled 
> differently, generally simply overwriting.
> 
> During traversal, the functions in DFBDT.c can be used to handle case where 
> the children of a given element have been re-ordered (e.g. someone moved a 
> paragraph to different position in the document). This uses the id mappings 
> in the HTML to figure out what elements in the concrete document they 
> correspond to, and when it sees them in a different order, it moves some of 
> them so that they come to match the order in which the corresponding HTML 
> elements appear. Unsupported elements are left untouched by this process.
> 
> The create operation will need to produce a brand new ODT file based on a 
> HTML file. This can simply be implemented by creating an empty ODT file, and 
> then doing a put operation - it’s essentially “updating” an empty document to 
> which new content has been added.
> 
> The entry points for these three functions are DFGet, DFPut, and DFCreate in 
> api/src/Operations.c. These each have a switch statement which looks at the 
> file type and calls through to a function in the appropriate filter to do the 
> conversion. In the future we may need a more generic/pluggable way of doing 
> this, but for the time being, defining three functions ODTGet, ODTPut, and 
> ODTCreate (corresponding to the existing WordGet, WordPut, and WordCreate 
> functions) and adding cases to the switch statements for these will be 
> sufficient.
> 
> It’s probably best to start off by having a look at these functions in 
> filters/ooxml/src/word/Word.c and following the code through there. If you’re 
> using Xcode, you can easily jump through the function call graph to go to the 
> implementation of a called function; I expect visual studio probably has 
> something similar. At any rate, I’ve mostly chosen function names that are 
> not prefixes of other function names, so it should be fairly easy to find the 
> function you’re looking for with grep if you don’t know what file it’s in 
> (this is something I love about C, which you can’t do so easily using 
> object-oriented languages).
> 
> The Word filter has two core classes used during conversion - WordPackage and 
> WordConverter (defined in their respective .h and .c files). A word package 
> encapsulates a .docx file, and contains data structures loaded from the XML 
> files stored within the .docx package (which is actually a zip file). There 
> are classes for things like the stylesheet, numbering information, the set of 
> footnotes/endnotes, and so forth. For ODF,  I already did a little bit of 
> work a while back defining skeleton versions of the corresponding classes 
> (ODFPackage, ODFManifest, and ODFSheet). The file ODF.c is empty but would be 
> a suitable place to put the get/put/create functions.
> 
> Data structures used in ODF differ somewhat from those of Word documents, 
> though there is a lot of conceptual similarity. The most significant 
> difference I can think of is the way that direct formatting is handled - ODF 
> treats *everything* as a style; if you apply direct formatting to a run of 
> text, then it creates what’s called an “automatic style” and references that 
> from the content. So styles, formatting, numbering, and numerous other things 
> will have to be represented differently, but much of the strategies used in 
> the word filter should carry across fairly easily. I need to document these 
> better, but perhaps it’s easiest if you get stuck to ask me questions, and 
> then we can put these on the wiki or in the source documentation.
> 
> Anyway, this is just a braindump of what I think the most relevant things 
> someone implementing an ODF filter will need to know. I’d love to be be 
> pestered with more questions about this, as I think getting started on this 
> important task would be a good step forward for the project, and demonstrate 
> our commitment to making interoperability easier for people.
> 
> —
> Dr Peter M. Kelly
> [email protected]
> 
> PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
> 

Reply via email to