[dev] Improving HTML-output after import from MS-Word

larrydlefever Mon, 21 Sep 2009 13:24:43 -0700

per
 
http://svn.services.openoffice.org/opengrok/xref/DEV300_m59/sw/source/filter/html/htmlatr.cxx#1013


there seem to be certain rules regarding when a <P> tag is output when using
"Save as ..." in Writer (I tried to translate the German here into English
-- I need help with that, incidentally):

Ein <P> wird nur geschrieben, wenn
- wir in keiner OL/UL/DL sind, oder
- der Absatz einer OL/UL nicht numeriert ist, oder
- keine Styles exportiert werden und
     - ein unterer Abstand oder
     - eine Absatz-Ausrichtung existiert, ode
- Styles exportiert werden und,
     - die Textkoerper-Vorlage geaendert wurde, oder
     - ein Benutzer-Format exportiert wird, oder
     - Absatz-Attribute existieren


A <P> is written only if:
 - we're not in a list of any kind; or
 - the paragraph we're in is in an unordered list; or
 - no Styles are being exported and a (lower distance?) exists or
     a paragraph-adjustment exists; or
 - Styles are being exported and the text-body format/style? was changed; or
 - a User-defined format is being exported; or
 - paragraph-attributes exist

I want to know if I'd need to hack that native code there, in order to get
cleaner HTML-output than I'm currently getting from OpenOffice.

Incidentally, I've also tried Exporting as XHTML, but the resultant output
is even worse than that from "Save as ...": stuff that should not appear in
a list does so, etc.

I've tweaked the Java-example servlet for document-conversion, so it takes
an MS-Word doc as upload and returns (really just the file:/// URL of) an
HTML-document.

I do like so in my code:

                // Setting the filter name
                propertyvalue[1] = new PropertyValue();
                propertyvalue[1].Name = "FilterName";
                propertyvalue[1].Value = "HTML (StarWriter)";

... which I believe means, effectively, "Save as ...", rather than "Export",
the latter involving a different area of the OpenOffice codebase, if I'm not
mistaken.

I've seen some documentation on using XSLT to configure or customize the
Export process, but, as I've just noted, the Export output seems worse than
the output I'm getting (which I believe is from "Save as ..." instead of
Export).

The problem is that the result (which is, at this point, a resume) comes out
looking double-spaced.  Also, there are two or three cases of another
formatting-issue that seem to have to do with <p>-tags (or divs) within one
or another type of HTML-list.

So, what's the best way to make the desired improvements in the HTML-output?

Should I just do some quick-and-dirty post-processing in my Java-code
(which, however, means processing the same file twice, essentially)?  Or
should I go deep into that native code to try to fix the relevant filter? 
Or is there a way to use XSLT in this case that I'm missing?



-- 
View this message in context: 
http://www.nabble.com/Improving-HTML-output-after-import-from-MS-Word-tp25530467p25530467.html
Sent from the openoffice - dev mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@openoffice.org
For additional commands, e-mail: dev-h...@openoffice.org

[dev] Improving HTML-output after import from MS-Word

Reply via email to