"Ian Abbott" <[EMAIL PROTECTED]> writes: > I came across this extract from a table on a website: > > <td ALIGN=CENTER VALIGN=CENTER WIDTH="120" HEIGHT="120"><a > href="66B27885.htm" "msover1('Pic1','thumbnails/MO66B27885.jpg');" > onMouseOut="msout1('Pic1','thumbnails/66B27885.jpg');"><img > SRC="thumbnails/66B27885.jpg" NAME="Pic1" BORDER=0 ></a></td> > > Note the string beginning "msover1(", which seems to be an > attribute value without a name, so that makes it illegal HTML.
I think it's even worse than that. My limited knowledge of SGML taught me that <foo bar> is equivalent to <foo bar=bar>, which means that given <foo bar>, "bar" is attribute *name*, not value. If I understand SGML correctly, attribute names cannot be quoted. This makes <foo "bar"> illegal even if <foo bar=10> or <foo bar> are perfectly valid. > I haven't traced what Wget is actually doing when it encounters > this, but it doesn't treat "66B27885.htm" as a URL to be > downloaded. According to Wget's notion of HTML, the A tag in question is simply not a well-formed tag. This means that Wget's parser will "back out" to the character "a" (the second char of <a href="...") and continue parsing from there. Generally, when faced with a syntax error, it is extremely hard to just ignore it and extract a useful result from garbage. In some cases it's possible; in most, it's just too much worse. Loosely, html-parse.c will recognize the following things as tags. (S stands for "strict" string, only letters, numbers, hyphen and underscore allowed, L stands for loosely matched string, i.e. everything except whitespace and separator, such as quote, ">", etc.) > I can't call this a bug, but is Wget doing the right thing by > ignoring the href altogether? <S S1=L1 S2=L2 ...> -- normal tag with attributes <S S1="L1" S2="L2" ...> -- like the above, but quotation allows more leeway on values. <S S1> -- the same as <S S1=S1> Given the amount of broken HTML on the web, it's easy to imagine for this parser to be confused about what's what. That is why the attribute names are matched "strictly". Now, it would be fairly easy to change the parser to match the attribute names loosely like it does for values, but to parse the above piece of broken HTML, it would have to be extended to handle: <S "L1"> (and, I assume) <S "L1"="L2"> I wonder if that's worth it. On the one hand, it might be helpful to someone (e.g. you). On the other hand, there will always be one more piece of illegal HTML that Wget *could* handle if tweaked hard enough.