Page in question:

http://docs.php.net/manual/en/tidy.examples.php

Suggestions (summary):

1. Change the example to use a strict doctype, and include the output from the script.

2. Explain what has happened

3. Give another example where a different type snippet is corrected

4. Explain what has happened

5. Give a third example where valid, but un-semantic, code will not be made semantic, to show how one still needs to consider proper usage, even though tidy is being used. (The chapter on Tidy in PHP Cookbook by O'Reilly gives a really faulty explanations in this regard.)

-------------------
Suggested text:
-------------------

The script above will output:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd";>
<html xmlns="http://www.w3.org/1999/xhtml";>
  <head>
    <title></title>
  </head>
  <body>
    a html document
  </body>
</html>

Notice how the missing doctype and missing XHTML-tags have been added, in this particular example the head, title and body-tags. The result is well formed XHTML. Indentation has also been added.

However, in this example Tidy still fails to produce entirely <dfn>valid</dfn> XHTML 1.0 strict, since any text in the body must be inside a block-level element, and tidy will not guess what block-element it should use. But Tidy can give a warning about issues like these. Adding the line echo $tidy->errorBuffer would produce the following warnings:

line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 7 - Warning: plain text isn't allowed in <head> elements
line 1 column 7 - Warning: inserting missing 'title' element

Example 2

[Code to be cleaned follows, same script in PHP]

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd";>
<html xmlns="http://www.w3.org/1999/xhtml";>
  <head>
    <title>A second example</title>
  </head>
  <body>
    <p>
      Badly <span>nested <em class=foo>and</span> missing tags.</p>
  </body>
</html>

This script will output:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd";>
<html xmlns="http://www.w3.org/1999/xhtml";>
  <head>
    <title>
      A second example
    </title>
  </head>
  <body>
    <p>
      Badly <span>nested <em class="foo">and</em> missing tags.</span>
    </p>
  </body>
</html>

Notice that the dfn and em tags now are correctly nested, and that quotation marks have been added to the attribute value. The missing closing tag for the em-element has been added as well. This code will now validate.

Example 3

<?php
ob_start();
?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd";>
<html xmlns="http://www.w3.org/1999/xhtml";>
  <head>
    <title>A third example</title>
  </head>
  <body>
    <p><abbr>Valid</abbr>, but <cite>misused</cite> use of HTML</p>
    <center>Well formed, but <font face="Verdana">unsemantic</font>
       HTML</center>
  </body>
</html>
<?php
$html = ob_get_clean();

$tidy = new tidy;
$config = array(
           'indent'         => true,
           'output-xhtml'   => false,
           'doctype'        => 'strict',
           'drop-font-tags' => true,
           'wrap'           => 200);
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();
echo $tidy;

This code will output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd";>
<html>
  <head>
    <title>
      A third example
    </title>
  </head>
  <body>
    <p>
      <abbr>Valid</abbr>, but <cite>misused</cite> use of HTML
    </p>Well formed, but unsemantic HTML<br>
  </body>
</html>

In this example we converted XHTML to HTML 4.01. Tidy can be used for conversions in either direction. The namespace attribute was removed. The elements center and font, now obsolete and better replaced with CSS, have been removed from the markup, thanks to the setting "drop-font-tags". Notice that Tidy will not drop these elements unless this explicitly has been mentioned in the settings, even though they have been completely removed from HTML 4.01 strict and XHML 1.0 strict and later. Tidy will not deduce such behavior from the doctype chosen. Tidy can remove some unsemantic markup, but it cannot check that an author has used (X)HTML elements properly from a semantic point of view.

Even so, this extension is a powerful tool to check for most and correct some markup errors.

----------

English question: Is it "A HTML doc" or "AN HTML doc" (one speaks HTML with a vowel sound first...)

Reply via email to