[dev] Improving HTML-output after import from MS-Word (again -- the sequel)
(system ostensibly accepted two previous messages, which remained indefinitely pending, rather than explicitly validating in terms of list-subscription -- e.g., you're not subscribed; just you MAY need to subscribe ... anyway ... I guess this try might hang indefinitely too ... fingers crossed that it won't) per http://svn.services.openoffice.org/opengrok/xref/DEV300_m59/sw/source/filter/html/htmlatr.cxx#1013 there seem to be certain rules regarding when a P tag is output when using Save as ... in Writer (I tried to translate the German here into English -- I need help with that, incidentally): Ein P wird nur geschrieben, wenn - wir in keiner OL/UL/DL sind, oder - der Absatz einer OL/UL nicht numeriert ist, oder - keine Styles exportiert werden und - ein unterer Abstand oder - eine Absatz-Ausrichtung existiert, ode - Styles exportiert werden und, - die Textkoerper-Vorlage geaendert wurde, oder - ein Benutzer-Format exportiert wird, oder - Absatz-Attribute existieren A P is written only if: - we're not in a list of any kind; or - the paragraph we're in is in an unordered list; or - no Styles are being exported and a (lower distance?) exists or a paragraph-adjustment exists; or - Styles are being exported and the text-body format/style? was changed; or - a User-defined format is being exported; or - paragraph-attributes exist I want to know if I'd need to hack that native code there, in order to get cleaner HTML-output than I'm currently getting from OpenOffice. Incidentally, I've also tried Exporting as XHTML, but the resultant output is even worse than that from Save as ...: stuff that should not appear in a list does so, etc. I've tweaked the Java-example servlet for document-conversion, so it takes an MS-Word doc as upload and returns (really just the file:/// URL of) an HTML-document. I do like so in my code: // Setting the filter name propertyvalue[1] = new PropertyValue(); propertyvalue[1].Name = FilterName; propertyvalue[1].Value = HTML (StarWriter); ... which I believe means, effectively, Save as ..., rather than Export, the latter involving a different area of the OpenOffice codebase, if I'm not mistaken. I've seen some documentation on using XSLT to configure or customize the Export process, but, as I've just noted, the Export output seems worse than the output I'm getting (which I believe is from Save as ... instead of Export). The problem is that the result (which is, at this point, a resume) comes out looking double-spaced. Also, there are two or three cases of another formatting-issue that seem to have to do with p-tags (or divs) within one or another type of HTML-list. So, what's the best way to make the desired improvements in the HTML-output? Should I just do some quick-and-dirty post-processing in my Java-code (which, however, means processing the same file twice, essentially)? Or should I go deep into that native code to try to fix the relevant filter? Or is there a way to use XSLT in this case that I'm missing? -- View this message in context: http://www.nabble.com/Improving-HTML-output-after-import-from-MS-Word--%28againthe-sequel%29-tp25531251p25531251.html Sent from the openoffice - dev mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@openoffice.org For additional commands, e-mail: dev-h...@openoffice.org
Re: [dev] Improving HTML-output after import from MS-Word (again)
Larry, Sorry you are having trouble with the ml. If you can't get it straightened out yourself, ask for help on dev-web. Meanwhile, here's the response you got to your first try. HTH, /tj/ Holger Meyer wrote: I noticed this reply to your message on the list (from mba). Seems like you did not get it? larrydlefever wrote: per http://svn.services.openoffice.org/opengrok/xref/DEV300_m59/sw/source/filter/html/htmlatr.cxx#1013 there seem to be certain rules regarding when a P tag is output when using Save as ... in Writer (I tried to translate the German here into English -- I need help with that, incidentally): Ein P wird nur geschrieben, wenn - wir in keiner OL/UL/DL sind, oder - der Absatz einer OL/UL nicht numeriert ist, oder - keine Styles exportiert werden und - ein unterer Abstand oder - eine Absatz-Ausrichtung existiert, ode - Styles exportiert werden und, - die Textkoerper-Vorlage geaendert wurde, oder - ein Benutzer-Format exportiert wird, oder - Absatz-Attribute existieren A P is written only if: - we're not in a list of any kind; or - the paragraph we're in is in an unordered list; or - no Styles are being exported and a (lower distance?) exists or a paragraph-adjustment exists; or - Styles are being exported and the text-body format/style? was changed; or - a User-defined format is being exported; or - paragraph-attributes exist I want to know if I'd need to hack that native code there, in order to get cleaner HTML-output than I'm currently getting from OpenOffice. Yes. Incidentally, I've also tried Exporting as XHTML, but the resultant output is even worse than that from Save as ...: stuff that should not appear in a list does so, etc. Could you create an issue with a sample document showing the problem and assign it to sus? I've tweaked the Java-example servlet for document-conversion, so it takes an MS-Word doc as upload and returns (really just the file:/// URL of) an HTML-document. I do like so in my code: // Setting the filter name propertyvalue[1] = new PropertyValue(); propertyvalue[1].Name = FilterName; propertyvalue[1].Value = HTML (StarWriter); ... which I believe means, effectively, Save as ..., rather than Export, the latter involving a different area of the OpenOffice codebase, if I'm not mistaken. Whether SaveAs or Export is chosen just depends on whether you use storeAsURL or storeToURL. The difference is only that in one case the document takes over the new location while in the other it doesn't. The GUI stuff around these two function also uses different filters in both areas, but that's a limitation you don't have when using the API. All filters suitable for SaveAs can be used for Export also (but not the other way around as only filters for formats that OOo can load will be accepted in storeAsURL). So, what's the best way to make the desired improvements in the HTML-output? As both filters (the C++ one for HTML as well as the xslt based one for XHTML) seem to fail for you, the best way probably is the one you are more familiar with. If you know something about xslt, perhaps hacking the xslt for XHTML is better, because the native filter not only requires good C++ knowledge but also getting familiar with an unpredictable amount of OOo code (what exactly you will need to know depends on where your journey will take you). Regards, Mathias -- Mathias Bauer (mba) - Project Lead OpenOffice.org Writer OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS Please don't reply to nospamfor...@gmx.de. I use it for the OOo lists and only rarely read other mails sent to it. - To unsubscribe, e-mail: dev-unsubscr...@openoffice.org For additional commands, e-mail: dev-h...@openoffice.org -- /tj/ T. J. Frazier Melbourne, FL (TJFrazier on OO.o) - To unsubscribe, e-mail: dev-unsubscr...@openoffice.org For additional commands, e-mail: dev-h...@openoffice.org
Re: [dev] Improving HTML-output after import from MS-Word
larrydlefever wrote: per http://svn.services.openoffice.org/opengrok/xref/DEV300_m59/sw/source/filter/html/htmlatr.cxx#1013 there seem to be certain rules regarding when a P tag is output when using Save as ... in Writer (I tried to translate the German here into English -- I need help with that, incidentally): Ein P wird nur geschrieben, wenn - wir in keiner OL/UL/DL sind, oder - der Absatz einer OL/UL nicht numeriert ist, oder - keine Styles exportiert werden und - ein unterer Abstand oder - eine Absatz-Ausrichtung existiert, ode - Styles exportiert werden und, - die Textkoerper-Vorlage geaendert wurde, oder - ein Benutzer-Format exportiert wird, oder - Absatz-Attribute existieren A P is written only if: - we're not in a list of any kind; or - the paragraph we're in is in an unordered list; or - no Styles are being exported and a (lower distance?) exists or a paragraph-adjustment exists; or - Styles are being exported and the text-body format/style? was changed; or - a User-defined format is being exported; or - paragraph-attributes exist I want to know if I'd need to hack that native code there, in order to get cleaner HTML-output than I'm currently getting from OpenOffice. Yes. Incidentally, I've also tried Exporting as XHTML, but the resultant output is even worse than that from Save as ...: stuff that should not appear in a list does so, etc. Could you create an issue with a sample document showing the problem and assign it to sus? I've tweaked the Java-example servlet for document-conversion, so it takes an MS-Word doc as upload and returns (really just the file:/// URL of) an HTML-document. I do like so in my code: // Setting the filter name propertyvalue[1] = new PropertyValue(); propertyvalue[1].Name = FilterName; propertyvalue[1].Value = HTML (StarWriter); ... which I believe means, effectively, Save as ..., rather than Export, the latter involving a different area of the OpenOffice codebase, if I'm not mistaken. Whether SaveAs or Export is chosen just depends on whether you use storeAsURL or storeToURL. The difference is only that in one case the document takes over the new location while in the other it doesn't. The GUI stuff around these two function also uses different filters in both areas, but that's a limitation you don't have when using the API. All filters suitable for SaveAs can be used for Export also (but not the other way around as only filters for formats that OOo can load will be accepted in storeAsURL). So, what's the best way to make the desired improvements in the HTML-output? As both filters (the C++ one for HTML as well as the xslt based one for XHTML) seem to fail for you, the best way probably is the one you are more familiar with. If you know something about xslt, perhaps hacking the xslt for XHTML is better, because the native filter not only requires good C++ knowledge but also getting familiar with an unpredictable amount of OOo code (what exactly you will need to know depends on where your journey will take you). Regards, Mathias -- Mathias Bauer (mba) - Project Lead OpenOffice.org Writer OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS Please don't reply to nospamfor...@gmx.de. I use it for the OOo lists and only rarely read other mails sent to it. - To unsubscribe, e-mail: dev-unsubscr...@openoffice.org For additional commands, e-mail: dev-h...@openoffice.org
[dev] Improving HTML-output after import from MS-Word (again)
(my last try with this post was left pending for nearly 24 hours, without explanation; sorry if redundant) per http://svn.services.openoffice.org/opengrok/xref/DEV300_m59/sw/source/filter/html/htmlatr.cxx#1013 there seem to be certain rules regarding when a P tag is output when using Save as ... in Writer (I tried to translate the German here into English -- I need help with that, incidentally): Ein P wird nur geschrieben, wenn - wir in keiner OL/UL/DL sind, oder - der Absatz einer OL/UL nicht numeriert ist, oder - keine Styles exportiert werden und - ein unterer Abstand oder - eine Absatz-Ausrichtung existiert, ode - Styles exportiert werden und, - die Textkoerper-Vorlage geaendert wurde, oder - ein Benutzer-Format exportiert wird, oder - Absatz-Attribute existieren A P is written only if: - we're not in a list of any kind; or - the paragraph we're in is in an unordered list; or - no Styles are being exported and a (lower distance?) exists or a paragraph-adjustment exists; or - Styles are being exported and the text-body format/style? was changed; or - a User-defined format is being exported; or - paragraph-attributes exist I want to know if I'd need to hack that native code there, in order to get cleaner HTML-output than I'm currently getting from OpenOffice. the problem is double-spaced output where it should be single-spaced; plus the occasional other glitch seemingly having to do with p tags within certain HTML lists. -- View this message in context: http://www.nabble.com/Improving-HTML-output-after-import-from-MS-Word--%28again%29-tp25530876p25530876.html Sent from the openoffice - dev mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@openoffice.org For additional commands, e-mail: dev-h...@openoffice.org
Re: [dev] Improving HTML-output after import from MS-Word (again)
I noticed this reply to your message on the list (from mba). Seems like you did not get it? larrydlefever wrote: per http://svn.services.openoffice.org/opengrok/xref/DEV300_m59/sw/source/filter/html/htmlatr.cxx#1013 there seem to be certain rules regarding when a P tag is output when using Save as ... in Writer (I tried to translate the German here into English -- I need help with that, incidentally): Ein P wird nur geschrieben, wenn - wir in keiner OL/UL/DL sind, oder - der Absatz einer OL/UL nicht numeriert ist, oder - keine Styles exportiert werden und - ein unterer Abstand oder - eine Absatz-Ausrichtung existiert, ode - Styles exportiert werden und, - die Textkoerper-Vorlage geaendert wurde, oder - ein Benutzer-Format exportiert wird, oder - Absatz-Attribute existieren A P is written only if: - we're not in a list of any kind; or - the paragraph we're in is in an unordered list; or - no Styles are being exported and a (lower distance?) exists or a paragraph-adjustment exists; or - Styles are being exported and the text-body format/style? was changed; or - a User-defined format is being exported; or - paragraph-attributes exist I want to know if I'd need to hack that native code there, in order to get cleaner HTML-output than I'm currently getting from OpenOffice. Yes. Incidentally, I've also tried Exporting as XHTML, but the resultant output is even worse than that from Save as ...: stuff that should not appear in a list does so, etc. Could you create an issue with a sample document showing the problem and assign it to sus? I've tweaked the Java-example servlet for document-conversion, so it takes an MS-Word doc as upload and returns (really just the file:/// URL of) an HTML-document. I do like so in my code: // Setting the filter name propertyvalue[1] = new PropertyValue(); propertyvalue[1].Name = FilterName; propertyvalue[1].Value = HTML (StarWriter); ... which I believe means, effectively, Save as ..., rather than Export, the latter involving a different area of the OpenOffice codebase, if I'm not mistaken. Whether SaveAs or Export is chosen just depends on whether you use storeAsURL or storeToURL. The difference is only that in one case the document takes over the new location while in the other it doesn't. The GUI stuff around these two function also uses different filters in both areas, but that's a limitation you don't have when using the API. All filters suitable for SaveAs can be used for Export also (but not the other way around as only filters for formats that OOo can load will be accepted in storeAsURL). So, what's the best way to make the desired improvements in the HTML-output? As both filters (the C++ one for HTML as well as the xslt based one for XHTML) seem to fail for you, the best way probably is the one you are more familiar with. If you know something about xslt, perhaps hacking the xslt for XHTML is better, because the native filter not only requires good C++ knowledge but also getting familiar with an unpredictable amount of OOo code (what exactly you will need to know depends on where your journey will take you). Regards, Mathias -- Mathias Bauer (mba) - Project Lead OpenOffice.org Writer OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS Please don't reply to nospamfor...@gmx.de. I use it for the OOo lists and only rarely read other mails sent to it. - To unsubscribe, e-mail: dev-unsubscr...@openoffice.org For additional commands, e-mail: dev-h...@openoffice.org - To unsubscribe, e-mail: dev-unsubscr...@openoffice.org For additional commands, e-mail: dev-h...@openoffice.org
[dev] Improving HTML-output after import from MS-Word
per http://svn.services.openoffice.org/opengrok/xref/DEV300_m59/sw/source/filter/html/htmlatr.cxx#1013 there seem to be certain rules regarding when a P tag is output when using Save as ... in Writer (I tried to translate the German here into English -- I need help with that, incidentally): Ein P wird nur geschrieben, wenn - wir in keiner OL/UL/DL sind, oder - der Absatz einer OL/UL nicht numeriert ist, oder - keine Styles exportiert werden und - ein unterer Abstand oder - eine Absatz-Ausrichtung existiert, ode - Styles exportiert werden und, - die Textkoerper-Vorlage geaendert wurde, oder - ein Benutzer-Format exportiert wird, oder - Absatz-Attribute existieren A P is written only if: - we're not in a list of any kind; or - the paragraph we're in is in an unordered list; or - no Styles are being exported and a (lower distance?) exists or a paragraph-adjustment exists; or - Styles are being exported and the text-body format/style? was changed; or - a User-defined format is being exported; or - paragraph-attributes exist I want to know if I'd need to hack that native code there, in order to get cleaner HTML-output than I'm currently getting from OpenOffice. Incidentally, I've also tried Exporting as XHTML, but the resultant output is even worse than that from Save as ...: stuff that should not appear in a list does so, etc. I've tweaked the Java-example servlet for document-conversion, so it takes an MS-Word doc as upload and returns (really just the file:/// URL of) an HTML-document. I do like so in my code: // Setting the filter name propertyvalue[1] = new PropertyValue(); propertyvalue[1].Name = FilterName; propertyvalue[1].Value = HTML (StarWriter); ... which I believe means, effectively, Save as ..., rather than Export, the latter involving a different area of the OpenOffice codebase, if I'm not mistaken. I've seen some documentation on using XSLT to configure or customize the Export process, but, as I've just noted, the Export output seems worse than the output I'm getting (which I believe is from Save as ... instead of Export). The problem is that the result (which is, at this point, a resume) comes out looking double-spaced. Also, there are two or three cases of another formatting-issue that seem to have to do with p-tags (or divs) within one or another type of HTML-list. So, what's the best way to make the desired improvements in the HTML-output? Should I just do some quick-and-dirty post-processing in my Java-code (which, however, means processing the same file twice, essentially)? Or should I go deep into that native code to try to fix the relevant filter? Or is there a way to use XSLT in this case that I'm missing? -- View this message in context: http://www.nabble.com/Improving-HTML-output-after-import-from-MS-Word-tp25530467p25530467.html Sent from the openoffice - dev mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@openoffice.org For additional commands, e-mail: dev-h...@openoffice.org