On 8 January 2015 at 16:59, Peter Kelly <[email protected]> wrote: > > On 8 Jan 2015, at 10:16 am, Dave Fisher <[email protected]> wrote: > > > > Hi Peter, > > > > This is a helpful email from your concrete discussion I can better > understand the mapping between the abstract / HTML model and the concrete / > DOCX, ODT. > > > > You mention differences in the style runs for Word and ODT of which I am > familiar from the OOXML side. Does the abstract model / HTML take a > particular approach towards style runs? Is there a concrete version of the > HTML model? Is there a specification or plan for the abstract model? > > As a general principle, no - a given filter is expected to handle > arbitrary HTML. > > However, there is a function for “normalising” a HTML document to change > nested sets of inline elements (span, b, i, etc.) into a flat sequence of > runs (each represented as a span element). The Word filter uses this, due > to Word’s flat model of inline runs. > > ODF text documents, on the other hand, *do* support nested formatting > runs, so when writing this filter it may make sense not to apply the > normalisation process used in the word filter. This should be done if there > is information that could not be represented in HTML and would be lost by > flattening the structure like we do for word. > > There’s been a few times where the topic of what internal representation > we should use has been raised - whether we should stick with HTML, come up > with our own entirely different model, or something else. I personally > think HTML is a good choice, but perhaps for those who have raised the > issue of an alternate intermediate form, this might be a good time to start > that discussion ;) >
Point taken, I am I assume the first who questioned it. But just to be precise, I am happy having HTML as the internal structure, but I am unhappy that filters can do what they like with the HTML. My goal is to define a set of access functions that filters should use to navigate/insert/delete tags and restrictions on what can be put in the tags. Just image one filter needs to id some tags, therefore uses id=, another filter needs to name some tags, therefore uses name=. If we are not careful here it will explode and reading HTML becomes nearly as complicated as reading the formats directly. We should have 1 and only 1 HTML definition, which the filters can use. rgds jan I. > > — > Dr Peter M. Kelly > [email protected] > > PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key> > (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966) > >
