Aha. The final solution to this was reconfiguring JTidy (the first
step in my processing pipeline) to say:
tidy.setXHTML(false);
tidy.setXmlOut(false);
instead of saying:
tidy.setXHTML(true);
Fixing that means JTidy no longer "pretty-prints" with an inserted
namespace, which means NekoHTML doesn't get a wrong namespace, which
means I avoid the eventual output problems. And now my tags look
right: <STRONG></STRONG> is being output.
Do I need to be concerned about this line showing up in my html source?
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">
Or is that appropriate for a regular html file?
Thanks so much! Code is working now.
Jenny Brown
On Thu, Apr 17, 2008 at 10:57 AM, Brian Minchau <[EMAIL PROTECTED]> wrote:
>
> Hi Jenny.
>
> Yes, Henry is right.
>
>
> I don't know how I missed what your wrote:
> > which results in browser bombs, and starts with:
> > <HTML xmlns="http://www.w3.org/1999/xhtml" lang="en">
>
> That default namespace forces this HTML element to be treated as XML.
> Likewise for any other element that is in a non-null namespace.
>
> - Brian
>
> ----- Forwarded by Brian Minchau/Toronto/IBM on 04/17/2008 11:54 AM -----
>
> Henry
> Zongaro/Toronto/I
> [EMAIL PROTECTED]
> To
>
> "Jenny Brown" <[EMAIL PROTECTED]>
> 04/17/2008 10:50 cc
> AM [email protected]
> Subject
> Re: Trouble exporting HTML from a
> DOM in memory
>
>
>
>
>
>
>
>
>
>
>
>
>
> Hi, Jenny.
>
> "Jenny Brown" <[EMAIL PROTECTED]> wrote on 2008-04-16 09:27:44 PM:
> > The main situation I'm having trouble with is empty tags. For
> > instance... my input file contains:
> > <P>This is some <STRONG></STRONG> paragraph text.</P>
> > <P>This is a textarea. <TEXTAREA name="foo"></TEXTAREA> It has text
> > after it.</P>
> >
> > It gets into my in-memory dom tree okay. But then when I try to use a
> > transformer to output the html, instead I get this which Firefox
> > chokes on:
> > <P>This is some <STRONG/> paragraph text.</P>
> > <P>This is a textarea. <TEXTAREA name="foo"/> It has text after it.</P>
> >
> > [Snip]
> >
> > Transformer transformer =
> TransformerFactory.newInstance().newTransformer();
> > transformer.setOutputProperty(OutputKeys.METHOD, "html");
> > transformer.setOutputProperty(OutputKeys.MEDIA_TYPE, "text/html");
> > transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
> > transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
> >
> > [Snip]
> >
> > So, I'm trying to tell it to give me html, but what I get is a
> > document that contains xml-like empty tags wherever the tag was empty,
> > which results in browser bombs, and starts with:
> > <HTML xmlns="http://www.w3.org/1999/xhtml" lang="en">
>
> I think this is the key. You have specified that you want to use the html
> output method, but your output is really xhtml. Because your output is in
> an XML namespace, the serializer is required to serialize the output as
> XML, despite the fact that you've used the html output method. However,
> XHTML has to adhere to certain lexical conventions in order to be correctly
> displayed in a browser that ordinary XML does not have to adhere to.
>
> XSLT 1.0 does not define an xhtml output method, but Xalan-J does allow you
> to give it a clue that what you're serializing is really XHTML. If you add
> the following output property, the serializer will emit empty tags using a
> space before the trailing /> - thus, <STRONG />
>
> transformer.setOutputProperty(OutputKeys.DOCTYPE_PUBLIC, "-//W3C//DTD XHTML
> 1.0 Transitional//EN");
>
> That will probably help with a tag like <br> which is always supposed to be
> empty - it will be serialized as <br /> - but probably not with STRONG and
> TEXTAREA which happen to have no content in your DOM tree, but ordinarily
> would have content. They really should be serialized as <STRONG></STRONG>
> rather than <STRONG />. This issue has previously been reported as JIra
> issue XALANJ-1906.[1]
>
> In the meanwhile, you probably have a couple of options for working around
> this issue: one would be recreate the DOM tree using elements that are in
> no namespace rather than being in the XHTML namespace - then the html
> output method would work properly; another would be search the DOM tree
> looking for elements that ordinarily have content that are actually empty,
> and give them a single whitespace node child or remove them from the tree
> entirely. You could also write XSLT stylesheets to implement any of those
> work-arounds; let us know if you'd like an example.
>
> Thanks,
>
> Henry
> [1] http://issues.apache.org/jira/browse/XALANJ-1906
> ------------------------------------------------------------------
> Henry Zongaro
> XML Transformation & Query Development
> IBM Toronto Lab T/L 313-6044; Phone +1 905 413-6044
> mailto:[EMAIL PROTECTED]
>
>