> On 8 Jan 2015, at 10:59 pm, Peter Kelly <[email protected]> wrote:
> 
>> On 8 Jan 2015, at 10:16 am, Dave Fisher <[email protected]> wrote:
>> 
>> Hi Peter,
>> 
>> This is a helpful email from your concrete discussion I can better 
>> understand the mapping between the abstract / HTML model and the concrete / 
>> DOCX, ODT.
>> 
>> You mention differences in the style runs for Word and ODT of which I am 
>> familiar from the OOXML side. Does the abstract model / HTML take a 
>> particular approach towards style runs? Is there a concrete version of the 
>> HTML model? Is there a specification or plan for the abstract model?
> 
> As a general principle, no - a given filter is expected to handle arbitrary 
> HTML.
> 
> However, there is a function for “normalising” a HTML document to change 
> nested sets of inline elements (span, b, i, etc.) into a flat sequence of 
> runs (each represented as a span element). The Word filter uses this, due to 
> Word’s flat model of inline runs.

Just thought I’d add a bit more detail on this, for anyone interested in 
exploring the implementation:

For .docx files, DFPut (api/src/Operations.c) calls WordPut 
(filters/ooxml/src/word/Word.c), which in turn creates a WordPackage object and 
then calls WordPackageUpdateFromHTML (filters/ooxml/src/word/WordPackage.c). 
The very first thing this does is to call HTML_normalizeDocument and 
HTML_pushDownInlineProperties (both in core/src/html/DFHTMLNormalization.c).

HTML_normalizeDocument merges adjacent text nodes (which in theory shouldn’t be 
necessary, but I found that sometimes libxml’s parser produces two or more in a 
row), and then goes through all the block-level elements, flattening any inline 
elements such that the resulting block node contains a series of spans, each 
with a style attribute set with the appropriate css formatting properties. For 
example, if you start with this:

<p>
    Here
    <b>
        is
        <i>
            some
        </i>
        text
    </b>
</p>

then you’ll end up with this:

<p>
    <span>Here</span>
    <span style=“font-weight: bold">
        is
    </span>
    <span style=“font-weight: bold; font-style: italic">
        some
    </span>
    <span style=“font-weight: bold">
        text
    </span>
</p>

HTML_pushDownInlineProperties checks block elements for any CSS properties that 
can be applied to inline formatting (such as font family, font size, text 
color) and moves them to the style attributes of the span elements within the 
block element. For example, the following:

<p style=“border: 1px solid black; font-size: 18”>
    <span>Some text</span>
</p>

would become this:

<p style=“border: 1px solid black”>
    <span style=“font-size: 18">Some text</span>
</p>

Both of these are pre-processing stages that happen before the primary 
traversal of the document tree begins, and the latter code in the Word filter 
expects the HTML documents to confirm to this more restrictive “dialect”. In 
the case of the inline properties, it’s because these settings have to go on 
the rPr elements in a word document, and are not allowed on the pPr elements 
(that is, Word is more strict in terms of which formatting properties can be 
set where; HTML allows you to set “inline” formatting properties on any element 
using a style attribute). So this pre-processing is largely to match the needs 
of the Word filter, but it’s likely that an ODF text document filter will need 
some pre-processing as well.

As we add more formats, I expect we’ll discover some common places where there 
the HTML input needs to be normalised to a certain form, and also places where 
it is better to leave it as-is. The ability to have nested inline elements in 
ODF is an example of the latter; we can probably avoid HTML_normalizeDocument 
in that case by having a direct relationship between HTML inline elements and 
ODF text-span elements. Depending on the situation, retaining such structure 
may be important - but that’s something I expect we’ll discover as we proceed 
with implementation.

—
Dr Peter M. Kelly
[email protected]

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

Reply via email to