Page in question:
http://docs.php.net/manual/en/tidy.examples.php
Suggestions (summary):
1. Change the example to use a strict doctype, and include the output
from the script.
2. Explain what has happened
3. Give another example where a different type snippet is corrected
4. Explain what has happened
5. Give a third example where valid, but un-semantic, code will not be
made semantic, to show how one still needs to consider proper usage,
even though tidy is being used. (The chapter on Tidy in PHP Cookbook by
O'Reilly gives a really faulty explanations in this regard.)
-------------------
Suggested text:
-------------------
The script above will output:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
a html document
</body>
</html>
Notice how the missing doctype and missing XHTML-tags have been added,
in this particular example the head, title and body-tags. The result is
well formed XHTML. Indentation has also been added.
However, in this example Tidy still fails to produce entirely
<dfn>valid</dfn> XHTML 1.0 strict, since any text in the body must be
inside a block-level element, and tidy will not guess what block-element
it should use. But Tidy can give a warning about issues like these.
Adding the line echo $tidy->errorBuffer would produce the following
warnings:
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 7 - Warning: plain text isn't allowed in <head> elements
line 1 column 7 - Warning: inserting missing 'title' element
Example 2
[Code to be cleaned follows, same script in PHP]
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>A second example</title>
</head>
<body>
<p>
Badly <span>nested <em class=foo>and</span> missing tags.</p>
</body>
</html>
This script will output:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
A second example
</title>
</head>
<body>
<p>
Badly <span>nested <em class="foo">and</em> missing tags.</span>
</p>
</body>
</html>
Notice that the dfn and em tags now are correctly nested, and that
quotation marks have been added to the attribute value. The missing
closing tag for the em-element has been added as well. This code will
now validate.
Example 3
<?php
ob_start();
?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>A third example</title>
</head>
<body>
<p><abbr>Valid</abbr>, but <cite>misused</cite> use of HTML</p>
<center>Well formed, but <font face="Verdana">unsemantic</font>
HTML</center>
</body>
</html>
<?php
$html = ob_get_clean();
$tidy = new tidy;
$config = array(
'indent' => true,
'output-xhtml' => false,
'doctype' => 'strict',
'drop-font-tags' => true,
'wrap' => 200);
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();
echo $tidy;
This code will output:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>
A third example
</title>
</head>
<body>
<p>
<abbr>Valid</abbr>, but <cite>misused</cite> use of HTML
</p>Well formed, but unsemantic HTML<br>
</body>
</html>
In this example we converted XHTML to HTML 4.01. Tidy can be used for
conversions in either direction. The namespace attribute was removed.
The elements center and font, now obsolete and better replaced with CSS,
have been removed from the markup, thanks to the setting
"drop-font-tags". Notice that Tidy will not drop these elements unless
this explicitly has been mentioned in the settings, even though they
have been completely removed from HTML 4.01 strict and XHML 1.0 strict
and later. Tidy will not deduce such behavior from the doctype chosen.
Tidy can remove some unsemantic markup, but it cannot check that an
author has used (X)HTML elements properly from a semantic point of view.
Even so, this extension is a powerful tool to check for most and correct
some markup errors.
----------
English question: Is it "A HTML doc" or "AN HTML doc" (one speaks HTML
with a vowel sound first...)