Re: Fw: Trouble exporting HTML from a DOM in memory

Jenny Brown Thu, 17 Apr 2008 15:32:19 -0700

Aha.  The final solution to this was reconfiguring JTidy (the first
step in my processing pipeline) to say:


                tidy.setXHTML(false);
                tidy.setXmlOut(false);

instead of saying:

                tidy.setXHTML(true);

Fixing that means JTidy no longer "pretty-prints" with an inserted
namespace, which means NekoHTML doesn't get a wrong namespace, which
means I avoid the eventual output problems.  And now my tags look
right:  <STRONG></STRONG> is being output.

Do I need to be concerned about this line showing up in my html source?
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">

Or is that appropriate for a regular html file?

Thanks so much!  Code is working now.

Jenny Brown


On Thu, Apr 17, 2008 at 10:57 AM, Brian Minchau <[EMAIL PROTECTED]> wrote:
>
>  Hi Jenny.
>
>  Yes, Henry is right.
>
>
>  I don't know how I missed what your wrote:
>  > which results in browser bombs, and starts with:
>  > <HTML xmlns="http://www.w3.org/1999/xhtml"; lang="en">
>
>  That default namespace forces this HTML element to be treated as XML.
>  Likewise for any other element that is in a non-null namespace.
>
>  - Brian
>
>  ----- Forwarded by Brian Minchau/Toronto/IBM on 04/17/2008 11:54 AM -----
>
>              Henry
>              Zongaro/Toronto/I
>              [EMAIL PROTECTED]                                                
>    To
>
>                                        "Jenny Brown" <[EMAIL PROTECTED]>
>              04/17/2008 10:50                                           cc
>              AM                        [email protected]
>                                                                    Subject
>                                        Re: Trouble exporting HTML from a
>                                        DOM in memory
>
>
>
>
>
>
>
>
>
>
>
>
>
>  Hi, Jenny.
>
>  "Jenny Brown" <[EMAIL PROTECTED]> wrote on 2008-04-16 09:27:44 PM:
>  > The main situation I'm having trouble with is empty tags.  For
>  > instance... my input file contains:
>  > <P>This is some <STRONG></STRONG> paragraph text.</P>
>  > <P>This is a textarea.  <TEXTAREA name="foo"></TEXTAREA>  It has text
>  > after it.</P>
>  >
>  > It gets into my in-memory dom tree okay.  But then when I try to use a
>  > transformer to output the html, instead I get this which Firefox
>  > chokes on:
>  > <P>This is some <STRONG/> paragraph text.</P>
>  > <P>This is a textarea.  <TEXTAREA name="foo"/> It has text after it.</P>
>  >
>  > [Snip]
>  >
>  > Transformer transformer =
>  TransformerFactory.newInstance().newTransformer();
>  > transformer.setOutputProperty(OutputKeys.METHOD, "html");
>  > transformer.setOutputProperty(OutputKeys.MEDIA_TYPE, "text/html");
>  > transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
>  > transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
>  >
>  > [Snip]
>  >
>  > So, I'm trying to tell it to give me html, but what I get is a
>  > document that contains xml-like empty tags wherever the tag was empty,
>  > which results in browser bombs, and starts with:
>  > <HTML xmlns="http://www.w3.org/1999/xhtml"; lang="en">
>
>  I think this is the key.  You have specified that you want to use the html
>  output method, but your output is really xhtml.  Because your output is in
>  an XML namespace, the serializer is required to serialize the output as
>  XML, despite the fact that you've used the html output method.  However,
>  XHTML has to adhere to certain lexical conventions in order to be correctly
>  displayed in a browser that ordinary XML does not have to adhere to.
>
>  XSLT 1.0 does not define an xhtml output method, but Xalan-J does allow you
>  to give it a clue that what you're serializing is really XHTML.  If you add
>  the following output property, the serializer will emit empty tags using a
>  space before the trailing /> - thus, <STRONG />
>
>  transformer.setOutputProperty(OutputKeys.DOCTYPE_PUBLIC, "-//W3C//DTD XHTML
>  1.0 Transitional//EN");
>
>  That will probably help with a tag like <br> which is always supposed to be
>  empty - it will be serialized as <br /> - but probably not with STRONG and
>  TEXTAREA which happen to have no content in your DOM tree, but ordinarily
>  would have content.  They really should be serialized as <STRONG></STRONG>
>  rather than <STRONG />.  This issue has previously been reported as JIra
>  issue XALANJ-1906.[1]
>
>  In the meanwhile, you probably have a couple of options for working around
>  this issue:  one would be recreate the DOM tree using elements that are in
>  no namespace rather than being in the XHTML namespace - then the html
>  output method would work properly; another would be search the DOM tree
>  looking for elements that ordinarily have content that are actually empty,
>  and give them a single whitespace node child or remove them from the tree
>  entirely.  You could also write XSLT stylesheets to implement any of those
>  work-arounds; let us know if you'd like an example.
>
>  Thanks,
>
>  Henry
>  [1] http://issues.apache.org/jira/browse/XALANJ-1906
>  ------------------------------------------------------------------
>  Henry Zongaro
>  XML Transformation & Query Development
>  IBM Toronto Lab   T/L 313-6044;  Phone +1 905 413-6044
>  mailto:[EMAIL PROTECTED]
>
>

Re: Fw: Trouble exporting HTML from a DOM in memory

Reply via email to