> From: [EMAIL PROTECTED] > Date: Fri, 31 Aug 2007 14:06:59 +0000 > > Tobia Conforto linux.it> writes: > >> I have a data source from which I get SAX text nodes into my pipeline >> that contain escaped HTML entities and tags. In Java syntax: >> >> "Lorem ipsum — dolor sit amet. Consectetuer" >> >> or, in XML syntax: >> >> Lorem ipsum — dolor sit amet. <br> Consectetuer >> >> As you can see, the entities and tags are escaped and part of the >> text node. >> >> I cannot change this data source component, therefore I need a >> transformer to examine every text node in the stream, split it at the >> fake " " tags, substitute them with elements, and >> replace every escaped entity with the relevant Unicode character. > > That's one of the rare cases where I consider disable-output-escaping="yes"> > a valid approach [1]. I don't know if there is > something comparable directly on the Java side.
Unless I'm mistaken, doing that on his example would result in an invalid document as there's no matching element...? It would be okay if it can be guaranteed that the included text is nice well-formed XHTML, but if it's plain old HTML then it sounds to me more like a job for the jtidy or neko-based HTML transformers. We have something similar in our application; I arrange the early part of the pipeline so that the escaped HTML appears within a unique element e.g. Lorem ipsum <br&ht; dolor , pass it through the html transformer and follow that by a small xsl transformation to strip out the some_escaped_html elements (and the html & body elements that JTidy inserts) + the usual "passthrough" templates for all other nodes. Net result, the same SAX stream but with the HTML unescaped and cleaned up so it's well-formed again. Andrew. _________________________________________________________________ Get free emoticon packs and customisation from Windows Live. http://www.pimpmylive.co.uk --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]