Re: A strange bit of HTML

Hrvoje Niksic Wed, 16 Jan 2002 12:12:24 -0800

"Ian Abbott" <[EMAIL PROTECTED]> writes:

> I came across this extract from a table on a website:
> 
> <td ALIGN=CENTER VALIGN=CENTER WIDTH="120" HEIGHT="120"><a
> href="66B27885.htm" "msover1('Pic1','thumbnails/MO66B27885.jpg');"
> onMouseOut="msout1('Pic1','thumbnails/66B27885.jpg');"><img
> SRC="thumbnails/66B27885.jpg" NAME="Pic1" BORDER=0 ></a></td>
> 
> Note the string beginning "msover1(", which seems to be an
> attribute value without a name, so that makes it illegal HTML.


I think it's even worse than that.  My limited knowledge of SGML
taught me that <foo bar> is equivalent to <foo bar=bar>, which means
that given <foo bar>, "bar" is attribute *name*, not value.

If I understand SGML correctly, attribute names cannot be quoted.
This makes <foo "bar"> illegal even if <foo bar=10> or <foo bar> are
perfectly valid.

> I haven't traced what Wget is actually doing when it encounters
> this, but it doesn't treat "66B27885.htm" as a URL to be
> downloaded.

According to Wget's notion of HTML, the A tag in question is simply
not a well-formed tag.  This means that Wget's parser will "back out"
to the character "a" (the second char of <a href="...") and continue
parsing from there.  Generally, when faced with a syntax error, it is
extremely hard to just ignore it and extract a useful result from
garbage.  In some cases it's possible; in most, it's just too much
worse.

Loosely, html-parse.c will recognize the following things as tags.  (S
stands for "strict" string, only letters, numbers, hyphen and
underscore allowed, L stands for loosely matched string,
i.e. everything except whitespace and separator, such as quote, ">",
etc.)

> I can't call this a bug, but is Wget doing the right thing by
> ignoring the href altogether?

<S S1=L1 S2=L2 ...>     -- normal tag with attributes
<S S1="L1" S2="L2" ...> -- like the above, but quotation allows more
                           leeway on values.
<S S1>                  -- the same as <S S1=S1>

Given the amount of broken HTML on the web, it's easy to imagine for
this parser to be confused about what's what.  That is why the
attribute names are matched "strictly".

Now, it would be fairly easy to change the parser to match the
attribute names loosely like it does for values, but to parse the
above piece of broken HTML, it would have to be extended to handle:

    <S "L1">

(and, I assume)

    <S "L1"="L2">

I wonder if that's worth it.  On the one hand, it might be helpful to
someone (e.g. you).  On the other hand, there will always be one more
piece of illegal HTML that Wget *could* handle if tweaked hard enough.

Re: A strange bit of HTML

Reply via email to