George Prekas wrote:

> I have found a bug in Wget version 1.8.2 concerning comment handling (
<!--
> comment --> ). Take a look at the following illegal HTML code:
> <HTML>
> <BODY>
> <a href="test1.html">test1.html</a>
> <!-->
> <a href="test2.html">test2.html</a>
< <!-->
> </BODY>
> </HTML>
>
> Now, save the above snippet as test.html and try wget -Fi test.html. You
> will notice that it doesn't recognise the second link. I have found a
> solution to the above situation and have properly patched html-parse.c and
I
> would like some info on how can I give you the patch.

The HTML code is legitimate, but it only contains one link. The following
three lines constitute a single comment:

<!-->
<a href="test2.html">test2.html</a>
<!-->

A comment begins at "<!--" and ends at "-->". The trailing ">" on the first
of these lines and the leading "<!" on the third of these lines are part of
the comment. That is, the comment text is:

>
<a href="test2.html">test2.html</a>
<!

At any rate, one should not expect predictable behavior for broken HTML.
What should wget do with the following?

<a href="test1.html">test1.html
<!-->
</a>
<!-->

In one version, it might choose to follow the link to test1.html and in
another version it might not.

Tony

Reply via email to