----- Original Message ----- From: "Tony Lewis" <[EMAIL PROTECTED]> To: "George Prekas" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Saturday, May 31, 2003 8:47 AM Subject: Re: Comment handling
> George Prekas wrote: > > > > I have found a bug in Wget version 1.8.2 concerning comment handling ( > <!-- > > comment --> ). Take a look at the following illegal HTML code: > > <HTML> > > <BODY> > > <a href="test1.html">test1.html</a> > > <!--> > > <a href="test2.html">test2.html</a> > < <!--> > > </BODY> > > </HTML> > > > > Now, save the above snippet as test.html and try wget -Fi test.html. You > > will notice that it doesn't recognise the second link. I have found a > > solution to the above situation and have properly patched html-parse.c and > I > > would like some info on how can I give you the patch. > > The HTML code is legitimate, but it only contains one link. The following > three lines constitute a single comment: > > <!--> > <a href="test2.html">test2.html</a> > <!--> > > A comment begins at "<!--" and ends at "-->". The trailing ">" on the first > of these lines and the leading "<!" on the third of these lines are part of > the comment. That is, the comment text is: > > > > <a href="test2.html">test2.html</a> > <! > > At any rate, one should not expect predictable behavior for broken HTML. > What should wget do with the following? You are probably right. I have pointed this because I have seen pages that use as a separator <!--------------> with lots of dashes and althrough Internet Explorer shows the page, wget can not download it correctly. What do think about finishing the comment at the >? > > <a href="test1.html">test1.html > <!--> > </a> > <!--> > > In one version, it might choose to follow the link to test1.html and in > another version it might not. > > Tony > >