[dev] Improving HTML-output after import from MS-Word (again -- the sequel)

2009-09-23 Thread larrydlefever

(system ostensibly accepted two previous messages, which remained
indefinitely pending, rather than explicitly validating in terms of
list-subscription -- e.g., you're not subscribed; just you MAY need to
subscribe ... anyway ... I guess this try might hang indefinitely too ...
fingers crossed that it won't)

per
 
http://svn.services.openoffice.org/opengrok/xref/DEV300_m59/sw/source/filter/html/htmlatr.cxx#1013

there seem to be certain rules regarding when a P tag is output when using
Save as ... in Writer (I tried to translate the German here into English
-- I need help with that, incidentally):

Ein P wird nur geschrieben, wenn
- wir in keiner OL/UL/DL sind, oder
- der Absatz einer OL/UL nicht numeriert ist, oder
- keine Styles exportiert werden und
 - ein unterer Abstand oder
 - eine Absatz-Ausrichtung existiert, ode
- Styles exportiert werden und,
 - die Textkoerper-Vorlage geaendert wurde, oder
 - ein Benutzer-Format exportiert wird, oder
 - Absatz-Attribute existieren


A P is written only if:
 - we're not in a list of any kind; or
 - the paragraph we're in is in an unordered list; or
 - no Styles are being exported and a (lower distance?) exists or
 a paragraph-adjustment exists; or
 - Styles are being exported and the text-body format/style? was changed; or
 - a User-defined format is being exported; or
 - paragraph-attributes exist

I want to know if I'd need to hack that native code there, in order to get
cleaner HTML-output than I'm currently getting from OpenOffice.

Incidentally, I've also tried Exporting as XHTML, but the resultant output
is even worse than that from Save as ...: stuff that should not appear in
a list does so, etc.

I've tweaked the Java-example servlet for document-conversion, so it takes
an MS-Word doc as upload and returns (really just the file:/// URL of) an
HTML-document.

I do like so in my code:

// Setting the filter name
propertyvalue[1] = new PropertyValue();
propertyvalue[1].Name = FilterName;
propertyvalue[1].Value = HTML (StarWriter);

... which I believe means, effectively, Save as ..., rather than Export,
the latter involving a different area of the OpenOffice codebase, if I'm not
mistaken.

I've seen some documentation on using XSLT to configure or customize the
Export process, but, as I've just noted, the Export output seems worse than
the output I'm getting (which I believe is from Save as ... instead of
Export).

The problem is that the result (which is, at this point, a resume) comes out
looking double-spaced.  Also, there are two or three cases of another
formatting-issue that seem to have to do with p-tags (or divs) within one
or another type of HTML-list.

So, what's the best way to make the desired improvements in the HTML-output?

Should I just do some quick-and-dirty post-processing in my Java-code
(which, however, means processing the same file twice, essentially)?  Or
should I go deep into that native code to try to fix the relevant filter? 
Or is there a way to use XSLT in this case that I'm missing? 
-- 
View this message in context: 
http://www.nabble.com/Improving-HTML-output-after-import-from-MS-Word--%28againthe-sequel%29-tp25531251p25531251.html
Sent from the openoffice - dev mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: dev-unsubscr...@openoffice.org
For additional commands, e-mail: dev-h...@openoffice.org



Re: [dev] Improving HTML-output after import from MS-Word (again)

2009-09-23 Thread T. J. Frazier

Larry,
Sorry you are having trouble with the ml. If you can't get it 
straightened out yourself, ask for help on dev-web.

Meanwhile, here's the response you got to your first try.
HTH, /tj/

Holger Meyer wrote:

I noticed this reply to your message on the list (from mba). Seems like
you did not get it?


larrydlefever wrote:



per
 
http://svn.services.openoffice.org/opengrok/xref/DEV300_m59/sw/source/filter/html/htmlatr.cxx#1013


there seem to be certain rules regarding when a P tag is output when using
Save as ... in Writer (I tried to translate the German here into English
-- I need help with that, incidentally):

Ein P wird nur geschrieben, wenn
- wir in keiner OL/UL/DL sind, oder
- der Absatz einer OL/UL nicht numeriert ist, oder
- keine Styles exportiert werden und
 - ein unterer Abstand oder
 - eine Absatz-Ausrichtung existiert, ode
- Styles exportiert werden und,
 - die Textkoerper-Vorlage geaendert wurde, oder
 - ein Benutzer-Format exportiert wird, oder
 - Absatz-Attribute existieren


A P is written only if:
 - we're not in a list of any kind; or
 - the paragraph we're in is in an unordered list; or
 - no Styles are being exported and a (lower distance?) exists or
 a paragraph-adjustment exists; or
 - Styles are being exported and the text-body format/style? was changed; or
 - a User-defined format is being exported; or
 - paragraph-attributes exist

I want to know if I'd need to hack that native code there, in order to get
cleaner HTML-output than I'm currently getting from OpenOffice.
  


Yes.



Incidentally, I've also tried Exporting as XHTML, but the resultant output
is even worse than that from Save as ...: stuff that should not appear in
a list does so, etc.
  


Could you create an issue with a sample document showing the problem and
assign it to sus?



I've tweaked the Java-example servlet for document-conversion, so it takes
an MS-Word doc as upload and returns (really just the file:/// URL of) an
HTML-document.

I do like so in my code:

// Setting the filter name
propertyvalue[1] = new PropertyValue();
propertyvalue[1].Name = FilterName;
propertyvalue[1].Value = HTML (StarWriter);

... which I believe means, effectively, Save as ..., rather than Export,
the latter involving a different area of the OpenOffice codebase, if I'm not
mistaken.
  


Whether SaveAs or Export is chosen just depends on whether you use
storeAsURL or storeToURL. The difference is only that in one case
the document takes over the new location while in the other it doesn't.
The GUI stuff around these two function also uses different filters in
both areas, but that's a limitation you don't have when using the API.
All filters suitable for SaveAs can be used for Export also (but not
the other way around as only filters for formats that OOo can load will
be accepted in storeAsURL).



So, what's the best way to make the desired improvements in the HTML-output?
  


As both filters (the C++ one for HTML as well as the xslt based one for
XHTML) seem to fail for you, the best way probably is the one you are
more familiar with. If you know something about xslt, perhaps hacking
the xslt for XHTML is better, because the native filter not only
requires good C++ knowledge but also getting familiar with an
unpredictable amount of OOo code (what exactly you will need to know
depends on where your journey will take you).

Regards,
Mathias

-- Mathias Bauer (mba) - Project Lead OpenOffice.org Writer
OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS Please
don't reply to nospamfor...@gmx.de. I use it for the OOo lists and
only rarely read other mails sent to it.
- To
unsubscribe, e-mail: dev-unsubscr...@openoffice.org For additional
commands, e-mail: dev-h...@openoffice.org




--
/tj/

T. J. Frazier
Melbourne, FL

(TJFrazier on OO.o)


-
To unsubscribe, e-mail: dev-unsubscr...@openoffice.org
For additional commands, e-mail: dev-h...@openoffice.org



Re: [dev] Improving HTML-output after import from MS-Word

2009-09-22 Thread Mathias Bauer
larrydlefever wrote:

 per
  
 http://svn.services.openoffice.org/opengrok/xref/DEV300_m59/sw/source/filter/html/htmlatr.cxx#1013
 
 there seem to be certain rules regarding when a P tag is output when using
 Save as ... in Writer (I tried to translate the German here into English
 -- I need help with that, incidentally):
 
 Ein P wird nur geschrieben, wenn
 - wir in keiner OL/UL/DL sind, oder
 - der Absatz einer OL/UL nicht numeriert ist, oder
 - keine Styles exportiert werden und
  - ein unterer Abstand oder
  - eine Absatz-Ausrichtung existiert, ode
 - Styles exportiert werden und,
  - die Textkoerper-Vorlage geaendert wurde, oder
  - ein Benutzer-Format exportiert wird, oder
  - Absatz-Attribute existieren
 
 
 A P is written only if:
  - we're not in a list of any kind; or
  - the paragraph we're in is in an unordered list; or
  - no Styles are being exported and a (lower distance?) exists or
  a paragraph-adjustment exists; or
  - Styles are being exported and the text-body format/style? was changed; or
  - a User-defined format is being exported; or
  - paragraph-attributes exist
 
 I want to know if I'd need to hack that native code there, in order to get
 cleaner HTML-output than I'm currently getting from OpenOffice.

Yes.

 Incidentally, I've also tried Exporting as XHTML, but the resultant output
 is even worse than that from Save as ...: stuff that should not appear in
 a list does so, etc.

Could you create an issue with a sample document showing the problem and
assign it to sus?

 I've tweaked the Java-example servlet for document-conversion, so it takes
 an MS-Word doc as upload and returns (really just the file:/// URL of) an
 HTML-document.
 
 I do like so in my code:
 
   // Setting the filter name
   propertyvalue[1] = new PropertyValue();
   propertyvalue[1].Name = FilterName;
   propertyvalue[1].Value = HTML (StarWriter);
 
 ... which I believe means, effectively, Save as ..., rather than Export,
 the latter involving a different area of the OpenOffice codebase, if I'm not
 mistaken.

Whether SaveAs or Export is chosen just depends on whether you use
storeAsURL or storeToURL. The difference is only that in one case
the document takes over the new location while in the other it doesn't.
The GUI stuff around these two function also uses different filters in
both areas, but that's a limitation you don't have when using the API.
All filters suitable for SaveAs can be used for Export also (but not
the other way around as only filters for formats that OOo can load will
be accepted in storeAsURL).

 So, what's the best way to make the desired improvements in the HTML-output?

As both filters (the C++ one for HTML as well as the xslt based one for
XHTML) seem to fail for you, the best way probably is the one you are
more familiar with. If you know something about xslt, perhaps hacking
the xslt for XHTML is better, because the native filter not only
requires good C++ knowledge but also getting familiar with an
unpredictable amount of OOo code (what exactly you will need to know
depends on where your journey will take you).

Regards,
Mathias

-- 
Mathias Bauer (mba) - Project Lead OpenOffice.org Writer
OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS
Please don't reply to nospamfor...@gmx.de.
I use it for the OOo lists and only rarely read other mails sent to it.


-
To unsubscribe, e-mail: dev-unsubscr...@openoffice.org
For additional commands, e-mail: dev-h...@openoffice.org



[dev] Improving HTML-output after import from MS-Word (again)

2009-09-22 Thread larrydlefever

(my last try with this post was left pending for nearly 24 hours, without
explanation; sorry if redundant)

per
 
http://svn.services.openoffice.org/opengrok/xref/DEV300_m59/sw/source/filter/html/htmlatr.cxx#1013

there seem to be certain rules regarding when a P tag is output when using
Save as ... in Writer (I tried to translate the German here into English
-- I need help with that, incidentally):

Ein P wird nur geschrieben, wenn
- wir in keiner OL/UL/DL sind, oder
- der Absatz einer OL/UL nicht numeriert ist, oder
- keine Styles exportiert werden und
 - ein unterer Abstand oder
 - eine Absatz-Ausrichtung existiert, ode
- Styles exportiert werden und,
 - die Textkoerper-Vorlage geaendert wurde, oder
 - ein Benutzer-Format exportiert wird, oder
 - Absatz-Attribute existieren


A P is written only if:
 - we're not in a list of any kind; or
 - the paragraph we're in is in an unordered list; or
 - no Styles are being exported and a (lower distance?) exists or
 a paragraph-adjustment exists; or
 - Styles are being exported and the text-body format/style? was changed; or
 - a User-defined format is being exported; or
 - paragraph-attributes exist

I want to know if I'd need to hack that native code there, in order to get
cleaner HTML-output than I'm currently getting from OpenOffice. 

the problem is double-spaced output where it should be single-spaced; plus
the occasional other glitch seemingly having to do with p tags within
certain HTML lists.
-- 
View this message in context: 
http://www.nabble.com/Improving-HTML-output-after-import-from-MS-Word--%28again%29-tp25530876p25530876.html
Sent from the openoffice - dev mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: dev-unsubscr...@openoffice.org
For additional commands, e-mail: dev-h...@openoffice.org



Re: [dev] Improving HTML-output after import from MS-Word (again)

2009-09-22 Thread Holger Meyer
I noticed this reply to your message on the list (from mba). Seems like
you did not get it?


larrydlefever wrote:


  per
   
  http://svn.services.openoffice.org/opengrok/xref/DEV300_m59/sw/source/filter/html/htmlatr.cxx#1013
  
  there seem to be certain rules regarding when a P tag is output when using
  Save as ... in Writer (I tried to translate the German here into English
  -- I need help with that, incidentally):
  
  Ein P wird nur geschrieben, wenn
  - wir in keiner OL/UL/DL sind, oder
  - der Absatz einer OL/UL nicht numeriert ist, oder
  - keine Styles exportiert werden und
   - ein unterer Abstand oder
   - eine Absatz-Ausrichtung existiert, ode
  - Styles exportiert werden und,
   - die Textkoerper-Vorlage geaendert wurde, oder
   - ein Benutzer-Format exportiert wird, oder
   - Absatz-Attribute existieren
  
  
  A P is written only if:
   - we're not in a list of any kind; or
   - the paragraph we're in is in an unordered list; or
   - no Styles are being exported and a (lower distance?) exists or
   a paragraph-adjustment exists; or
   - Styles are being exported and the text-body format/style? was changed; or
   - a User-defined format is being exported; or
   - paragraph-attributes exist
  
  I want to know if I'd need to hack that native code there, in order to get
  cleaner HTML-output than I'm currently getting from OpenOffice.
   

Yes.


  Incidentally, I've also tried Exporting as XHTML, but the resultant output
  is even worse than that from Save as ...: stuff that should not appear in
  a list does so, etc.
   

Could you create an issue with a sample document showing the problem and
assign it to sus?


  I've tweaked the Java-example servlet for document-conversion, so it takes
  an MS-Word doc as upload and returns (really just the file:/// URL of) an
  HTML-document.
  
  I do like so in my code:
  
  // Setting the filter name
  propertyvalue[1] = new PropertyValue();
  propertyvalue[1].Name = FilterName;
  propertyvalue[1].Value = HTML (StarWriter);
  
  ... which I believe means, effectively, Save as ..., rather than Export,
  the latter involving a different area of the OpenOffice codebase, if I'm not
  mistaken.
   

Whether SaveAs or Export is chosen just depends on whether you use
storeAsURL or storeToURL. The difference is only that in one case
the document takes over the new location while in the other it doesn't.
The GUI stuff around these two function also uses different filters in
both areas, but that's a limitation you don't have when using the API.
All filters suitable for SaveAs can be used for Export also (but not
the other way around as only filters for formats that OOo can load will
be accepted in storeAsURL).


  So, what's the best way to make the desired improvements in the HTML-output?
   

As both filters (the C++ one for HTML as well as the xslt based one for
XHTML) seem to fail for you, the best way probably is the one you are
more familiar with. If you know something about xslt, perhaps hacking
the xslt for XHTML is better, because the native filter not only
requires good C++ knowledge but also getting familiar with an
unpredictable amount of OOo code (what exactly you will need to know
depends on where your journey will take you).

Regards,
Mathias

-- Mathias Bauer (mba) - Project Lead OpenOffice.org Writer
OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS Please
don't reply to nospamfor...@gmx.de. I use it for the OOo lists and
only rarely read other mails sent to it.
- To
unsubscribe, e-mail: dev-unsubscr...@openoffice.org For additional
commands, e-mail: dev-h...@openoffice.org


-
To unsubscribe, e-mail: dev-unsubscr...@openoffice.org
For additional commands, e-mail: dev-h...@openoffice.org

[dev] Improving HTML-output after import from MS-Word

2009-09-21 Thread larrydlefever

per
 
http://svn.services.openoffice.org/opengrok/xref/DEV300_m59/sw/source/filter/html/htmlatr.cxx#1013

there seem to be certain rules regarding when a P tag is output when using
Save as ... in Writer (I tried to translate the German here into English
-- I need help with that, incidentally):

Ein P wird nur geschrieben, wenn
- wir in keiner OL/UL/DL sind, oder
- der Absatz einer OL/UL nicht numeriert ist, oder
- keine Styles exportiert werden und
 - ein unterer Abstand oder
 - eine Absatz-Ausrichtung existiert, ode
- Styles exportiert werden und,
 - die Textkoerper-Vorlage geaendert wurde, oder
 - ein Benutzer-Format exportiert wird, oder
 - Absatz-Attribute existieren


A P is written only if:
 - we're not in a list of any kind; or
 - the paragraph we're in is in an unordered list; or
 - no Styles are being exported and a (lower distance?) exists or
 a paragraph-adjustment exists; or
 - Styles are being exported and the text-body format/style? was changed; or
 - a User-defined format is being exported; or
 - paragraph-attributes exist

I want to know if I'd need to hack that native code there, in order to get
cleaner HTML-output than I'm currently getting from OpenOffice.

Incidentally, I've also tried Exporting as XHTML, but the resultant output
is even worse than that from Save as ...: stuff that should not appear in
a list does so, etc.

I've tweaked the Java-example servlet for document-conversion, so it takes
an MS-Word doc as upload and returns (really just the file:/// URL of) an
HTML-document.

I do like so in my code:

// Setting the filter name
propertyvalue[1] = new PropertyValue();
propertyvalue[1].Name = FilterName;
propertyvalue[1].Value = HTML (StarWriter);

... which I believe means, effectively, Save as ..., rather than Export,
the latter involving a different area of the OpenOffice codebase, if I'm not
mistaken.

I've seen some documentation on using XSLT to configure or customize the
Export process, but, as I've just noted, the Export output seems worse than
the output I'm getting (which I believe is from Save as ... instead of
Export).

The problem is that the result (which is, at this point, a resume) comes out
looking double-spaced.  Also, there are two or three cases of another
formatting-issue that seem to have to do with p-tags (or divs) within one
or another type of HTML-list.

So, what's the best way to make the desired improvements in the HTML-output?

Should I just do some quick-and-dirty post-processing in my Java-code
(which, however, means processing the same file twice, essentially)?  Or
should I go deep into that native code to try to fix the relevant filter? 
Or is there a way to use XSLT in this case that I'm missing?



-- 
View this message in context: 
http://www.nabble.com/Improving-HTML-output-after-import-from-MS-Word-tp25530467p25530467.html
Sent from the openoffice - dev mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: dev-unsubscr...@openoffice.org
For additional commands, e-mail: dev-h...@openoffice.org