Mike,
When I tried this, the error that I got seemed to indicate that the
entities output in the result of the transformation should have had a
space between % sign and the word HTML.
I too am curious about this one.
Bradley
On Apr 16, 2008, at 3:43 PM, Mike Strauch wrote:
Hello,
I've recently upgraded Xalan from 2.6 to 2.7.1 and have run into the
following issues:
I'm using a TransformerIdentityImpl to transform the html below and
the result includes a lot of information that I believe is coming
from the dtd associated with the doctype, and I'm not sure why it is
being included. This alone is not my only concern. When I attempt
to validate the result as xml I receive the following error:
"White space is required after "<!ENTITY" in the entity declaration."
I am setting the following output properties on the transformer
itself:
omit-xml-declaration: no
standalone: no
method: xml
I have fiddled around with these output properties and have not been
able to acquire a result that does not include all of the doctype
information. Is there something I am missing?
Original:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Test of lazy DOM builder</title>
</head>
<body>
<p>Some text here is ok</p>
</body>
</html>
Transformed:
<?xml version="1.0" encoding="UTF-8" standalone="no"?><!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
" [\r\n<!ENTITY %HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for
XHTML//EN" >\r\n<!ENTITY %HTMLsymbol PUBLIC "-//W3C//ENTITIES
Symbols for XHTML//EN" >\r\n<!ENTITY %HTMLspecial PUBLIC "-//W3C//
ENTITIES Special for XHTML//EN" >\r\n<!--================== Imported
Names ====================================--><!-- media type, as per
[RFC2045] --><!-- comma-separated list of media types, as per
[RFC2045] --><!-- a character encoding, as per [RFC2045] --><!-- a
space separated list of character encodings, as per [RFC2045] --
><!-- a language code, as per [RFC3066] --><!-- a single character,
as per section 2.2 of [XML] --><!-- one or more digits --><!-- space-
separated list of link types --><!-- single or comma-separated list
of media descriptors --><!-- a Uniform Resource Identifier, see
[RFC2396] --><!-- a space separated list of Uniform Resource
Identifiers --><!-- date and time information. ISO date format --
><!-- script expression --><!-- style sheet data --><!-- used for
titles etc. --><!-- render in this frame --><!-- nn for pixels or nn
% for percentage length --><!-- pixel, percentage, or relative --
><!-- integer representing length in pixels --><!-- these are used
for image maps --><!-- comma separated list of lengths --><!-- used
for object, applet, img, input and iframe --><!-- a color using
sRGB: #RRGGBB as Hex values --><!-- There are also 16 widely known
color names with their sRGB values:\n\n Black = #000000
Green = #008000\n Silver = #C0C0C0 Lime = #00FF00\n
Gray = #808080 Olive = #808000\n White = #FFFFFF Yellow
= #FFFF00\n Maroon = #800000 Navy = #000080\n Red =
#FF0000 Blue = #0000FF\n Purple = #800080 Teal =
#008080\n Fuchsia= #FF00FF Aqua = #00FFFF\n--><!--
=================== Generic Attributes
===============================--><!-- core attributes common to
most elements\n id document-wide unique id\n class space
separated list of classes\n style associated style info\n
title advisory title/amplification\n--><!-- internationalization
attributes\n lang language code (backwards compatible)\n
xml:lang language code (as per XML 1.0 spec)\n dir
direction for weak/neutral text\n--><!-- attributes for common UI
events\n onclick a pointer button was clicked\n ondblclick a
pointer button was double clicked\n onmousedown a pointer button
was pressed down\n onmouseup a pointer button was released\n
onmousemove a pointer was moved onto the element\n onmouseout a
pointer was moved away from the element\n onkeypress a key was
pressed and released\n onkeydown a key was pressed down\n
onkeyup a key was released\n--><!-- attributes for elements that
can get the focus\n accesskey accessibility key character\n
tabindex position in tabbing order\n onfocus the element got
the focus\n onblur the element lost the focus\n--><!-- text
alignment for p, div, h1-h6. The default is\n align="left" for
ltr headings, "right" for rtl --><!--=================== Text
Elements ====================================--><!-- these can occur
at block or inline level --><!-- these can only occur at block level
--><!-- %Inline; covers inline or "text-level" elements --><!--
================== Block level elements
==============================--><!-- %Flow; mixes block and inline
and is used for list items etc. --><!--================== Content
models for exclusions =====================--><!-- a elements use
%Inline; excluding a --><!-- pre uses %Inline excluding img, object,
applet, big, small,\n font, or basefont --><!-- form uses %Flow;
excluding form --><!-- button uses %Flow; but excludes a, form, form
controls, iframe --><!--================ Document Structure
==================================--><!-- the namespace URI
designates the document profile --><!--================ Document
Head =======================================--><!-- content model is
%head.misc; combined with a single\n title and an optional base
element in any order --><!-- The title element is not considered
part of the flow of text.\n It should be displayed, for
example as the page header or\n window title. Exactly one
title is required per document.\n --><!-- document base URI --
><!-- generic metainformation --><!--\n Relationship values can be
used in principle:\n\n a) for document specific toolbars/menus
when used\n with the link element in document head e.g.
\n start, contents, previous, next, index, end, help\n b)
to link to a separate style sheet (rel="stylesheet")\n c) to make
a link to a script (rel="script")\n d) by stylesheets to control
how collections of\n html nodes are rendered into printed
documents\n e) to make a link to a printable version of this
document\n e.g. a PostScript or PDF version (rel="alternate"
media="print")\n--><!-- style info, which may include CDATA sections
--><!-- script statements, which may include CDATA sections --><!--
alternate content container for non script-based rendering --><!--
======================= Frames
=======================================--><!-- inline subwindow --
><!-- alternate content container for non frame-based rendering --
><!--=================== Document Body
====================================--><!-- generic language/style
container --><!--=================== Paragraphs
=======================================--><!--===================
Headings =========================================--><!--\n There
are six levels of headings from h1 (the most important)\n to h6
(the least important).\n--><!--=================== Lists
============================================--><!-- Unordered list
bullet styles --><!-- Unordered list --><!-- Ordered list numbering
style\n\n 1 arabic numbers 1, 2, 3, ...\n a lower
alpha a, b, c, ...\n A upper alpha A, B, C, ...
\n i lower roman i, ii, iii, ...\n I upper
roman I, II, III, ...\n\n The style is applied to the
sequence number which by default\n is reset to 1 for the first
list item in an ordered list.\n--><!-- Ordered (numbered) list --
><!-- single column list (DEPRECATED) --><!-- multiple column list
(DEPRECATED) --><!-- LIStyle is constrained to: "(%ULStyle;|
%OLStyle;)" --><!-- list item --><!-- definition lists - dt for
term, dd for its definition --><!--=================== Address
==========================================--><!-- information on
author --><!--=================== Horizontal Rule
==================================--><!--===================
Preformatted Text ================================--><!-- content is
%Inline; excluding \n "img|object|applet|big|small|sub|sup|
font|basefont" --><!--=================== Block-like Quotes
================================--><!--=================== Text
alignment ===================================--><!-- center content
--><!--=================== Inserted/Deleted Text
============================--><!--\n ins/del are allowed in block
and inline content, but its\n inappropriate to include block
content within an ins element\n occurring in inline content.\n--
><!--================== The Anchor Element
================================--><!-- content is %Inline; except
that anchors shouldn't be nested --><!--===================== Inline
Elements ================================--><!-- generic language/
style container --><!-- I18N BiDi over-ride --><!-- forced line
break --><!-- emphasis --><!-- strong emphasis --><!-- definitional
--><!-- program code --><!-- sample --><!-- something user would
type --><!-- variable --><!-- citation --><!-- abbreviation --><!--
acronym --><!-- inlined quote --><!-- subscript --><!-- superscript
--><!-- fixed pitch font --><!-- italic font --><!-- bold font --
><!-- bigger font --><!-- smaller font --><!-- underline --><!--
strike-through --><!-- strike-through --><!-- base font size --><!--
local change to font --><!--==================== Object
======================================--><!--\n object is used to
embed objects as part of HTML pages.\n param elements should
precede other content. Parameters\n can also be expressed as
attribute/value pairs on the\n object element itself when brevity
is desired.\n--><!--\n param is used to supply a named property
value.\n In XML it would seem natural to follow RDF and support an
\n abbreviated syntax where the param elements are replaced\n by
attribute value pairs on the object start tag.\n--><!--
=================== Java applet ==================================--
><!--\n One of code or object attributes must be present.\n Place
param elements before other content.\n--><!--===================
Images ===========================================--><!--\n To
avoid accessibility problems for people who aren't\n able to see
the image, you should provide a text\n description using the alt
and longdesc attributes.\n In addition, avoid the use of server-
side image maps.\n--><!-- usemap points to a map element which may
be in this document\n or an external document, although the latter
is not widely supported --><!--================== Client-side image
maps ============================--><!-- These can be placed in the
same document or grouped in a\n separate document although this
isn't yet widely supported --><!--================ Forms
===============================================--><!-- forms
shouldn't be nested --><!--\n Each label must not contain more than
ONE field\n Label elements shouldn't be nested.\n--><!-- the name
attribute is required for all but submit & reset --><!-- form
control --><!-- option selector --><!-- option group --><!--
selectable choice --><!-- multi-line text field --><!--\n The
fieldset element is used to group form fields.\n Only one legend
element should occur in the content\n and if present should only be
preceded by whitespace.\n--><!-- fieldset label --><!--\n Content is
%Flow; excluding a, form, form controls, iframe\n--><!-- push button
--><!-- single-line text input control (DEPRECATED) --><!--
======================= Tables
=======================================--><!-- Derived from IETF
HTML table standard, see [RFC1942] --><!--\n The border attribute
sets the thickness of the frame around the\n table. The default
units are screen pixels.\n\n The frame attribute specifies which
parts of the frame around\n the table should be rendered. The values
are not the same as\n CALS to avoid a name clash with the valign
attribute.\n--><!--\n The rules attribute defines which rules to
draw between cells:\n\n If rules is absent then assume:\n "none"
if border is absent or border="0" otherwise "all"\n--><!--
horizontal placement of table relative to document --><!--
horizontal alignment attributes for cell contents\n\n char
alignment char, e.g. char=':'\n charoff offset for alignment
char\n--><!-- vertical alignment attributes for cell contents --><!--
\ncolgroup groups a set of col elements. It allows you to group
\nseveral semantically related columns together.\n--><!--\n col
elements define the alignment properties for cells in\n one or more
columns.\n\n The width attribute specifies the width of the columns,
e.g.\n\n width=64 width in screen pixels\n
width=0.5* relative width of 0.5\n\n The span attribute causes
the attributes of one\n col element to apply to more than one column.
\n--><!--\n Use thead to duplicate headers when breaking table
\n across page boundaries, or for static headers when\n tbody
sections are rendered in scrolling panel.\n\n Use tfoot to
duplicate footers when breaking table\n across page boundaries,
or for static footers when\n tbody sections are rendered in
scrolling panel.\n\n Use multiple tbody sections when rules are
needed\n between groups of table rows.\n--><!-- Scope is simpler
than headers attribute for common tables --><!-- th is for headers,
td for data and for cells acting as both -->]>\r\n<html xmlns="http://www.w3.org/1999/xhtml
" xml:lang="en">\r\n\t<head>\r\n\t\t<title>Test of lazy DOM builder</
title>\r\n\t</head>\r\n\t<body>\r\n\t\t<p>Some text here is ok</p>\r
\n\t</body>\r\n</html>
Cheers!
-Mike