Damian C. wrote:

Thanks for your recent support. It's all been very helpful and we've got
our docx import working almost exactly the way that we want it.

However, as you know, we sometimes receive client documents with strange
style choices and the one that I've attached is a good example.

For some reason there are paragraph titles which are styled bold and/or
italic in Word and yet when converted to dita these are imported as
plain paragraphs with no styling. In other words we're not getting the
semantic bold/italic tags that we'd expect within the paragraph.

Now I can see that the styles being applied (such as p-BodyTextBold) are
based on a style called s123456basebodytext which seems a bit strange.

Not a problem.



However I can't tell if this is the problem.

No.




So I was wondering if you can shed some light on this?

<b> and <i> elements are generated only for MS-Word *character* styles, not for MS-Word *paragraph* styles like those found in "Project foo.docx" (see them listed below).

When an MS-Word paragraph style specifies that all the text contained in the paragraph is by default bold and/or italic, this conveys no semantic meaning. If needed, you are supposed to map this MS-Word paragraph style to a DITA semantic element generally rendered as bold and/or italic (e.g. <title>).

In a nutshell, out of the box, w2x has no way to generate a <p> having all its text wrapped in a single <b> and/or <i>.



--> What follows is an relatively easy way to implement what you want (as far as I understand it).

w2x -f foo.options "Project foo.docx"  out.dita

where attached "foo.options" is:
---
-o topic
-p edit.remove-styles.preserved-classes "/^p-BodyTextBold/"
-t foo.xslt
---
(If you want a map, simply replace "-o topic" by "-o map")

-p edit.remove-styles.preserved-classes "/^p-BodyTextBold/" means keep all classes having a name starting by "p-BodyTextBold" in the intermediate semantic XHTML which will be then converted to DITA by the means of W2X_install_dir/xslt/topic.xslt.

Attached "foo.xslt" is:
---
<xsl:stylesheet version="1.0"
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
   xmlns:h="http://www.w3.org/1999/xhtml";
   exclude-result-prefixes="h">

<xsl:import href="w2x:xslt/topic.xslt"/>

<xsl:template match="h:p[@class = 'p-BodyTextBold']">
   <p>
     <xsl:call-template name="processCommonAttributes"/>
     <b><xsl:apply-templates/></b>
   </p>
</xsl:template>

<xsl:template match="h:p[@class = 'p-BodyTextBoldItalics']">
   <p>
     <xsl:call-template name="processCommonAttributes"/>
     <b><i><xsl:apply-templates/></i></b>
   </p>
</xsl:template>

</xsl:stylesheet>
---





PS Do you know if there's a way to output the intermediate xhtml/css
files which are used when converting from Word to Dita? I'd quite like
to look at these for debugging purposes but I can't find a parameter
which lets me keep these files.


--> If you want to look at MS-Word styles converted the CSS styles, please convert your file to the "xhtml_css" format (the default output format) and then look for <style> inside the generated ".html" file.

Example:

w2x "Project foo.docx"  out.html

out.html contains:
---
...

.p-BodyTextBold {
     font-family: Arial;
     font-weight: bold;
     ...
}

.p-BodyTextBoldItalics {
     font-family: Arial;
     font-style: italic;
     font-weight: bold;
     ...
}

...
---



--> If you want to look at the intermediate semantic XHTML which is then converted to DITA by the means of W2X_install_dir/xslt/topic.xslt, please convert your file to the "xhtml_loose" format:

Example:

w2x -o xhtml_loose "Project foo.docx"  out.xhtml


-o topic
-p edit.remove-styles.preserved-classes "/^p-BodyTextBold/"
-t foo.xslt

Attachment: foo.xslt
Description: application/xslt

--
XMLmind Word To XML Support List
w2x-support@xmlmind.com
https://www.xmlmind.com/mailman/listinfo/w2x-support

Reply via email to