Re: A strange bit of HTML
Ian Abbott [EMAIL PROTECTED] writes: I came across this extract from a table on a website: td ALIGN=CENTER VALIGN=CENTER WIDTH=120 HEIGHT=120a href=66B27885.htm msover1('Pic1','thumbnails/MO66B27885.jpg'); onMouseOut=msout1('Pic1','thumbnails/66B27885.jpg');img SRC=thumbnails/66B27885.jpg NAME=Pic1 BORDER=0 /a/td Note the string beginning msover1(, which seems to be an attribute value without a name, so that makes it illegal HTML. I think it's even worse than that. My limited knowledge of SGML taught me that foo bar is equivalent to foo bar=bar, which means that given foo bar, bar is attribute *name*, not value. If I understand SGML correctly, attribute names cannot be quoted. This makes foo bar illegal even if foo bar=10 or foo bar are perfectly valid. I haven't traced what Wget is actually doing when it encounters this, but it doesn't treat 66B27885.htm as a URL to be downloaded. According to Wget's notion of HTML, the A tag in question is simply not a well-formed tag. This means that Wget's parser will back out to the character a (the second char of a href=...) and continue parsing from there. Generally, when faced with a syntax error, it is extremely hard to just ignore it and extract a useful result from garbage. In some cases it's possible; in most, it's just too much worse. Loosely, html-parse.c will recognize the following things as tags. (S stands for strict string, only letters, numbers, hyphen and underscore allowed, L stands for loosely matched string, i.e. everything except whitespace and separator, such as quote, , etc.) I can't call this a bug, but is Wget doing the right thing by ignoring the href altogether? S S1=L1 S2=L2 ... -- normal tag with attributes S S1=L1 S2=L2 ... -- like the above, but quotation allows more leeway on values. S S1 -- the same as S S1=S1 Given the amount of broken HTML on the web, it's easy to imagine for this parser to be confused about what's what. That is why the attribute names are matched strictly. Now, it would be fairly easy to change the parser to match the attribute names loosely like it does for values, but to parse the above piece of broken HTML, it would have to be extended to handle: S L1 (and, I assume) S L1=L2 I wonder if that's worth it. On the one hand, it might be helpful to someone (e.g. you). On the other hand, there will always be one more piece of illegal HTML that Wget *could* handle if tweaked hard enough.
Re: A strange bit of HTML
[EMAIL PROTECTED] writes: That sounds like they wanted onMouseOver=msover1(...) Which Wget would, by the way, have handled perfectly.
Re: A strange bit of HTML
Hi there! td ALIGN=CENTER VALIGN=CENTER WIDTH=120 HEIGHT=120a href=66B27885.htm msover1('Pic1','thumbnails/MO66B27885.jpg'); onMouseOut=msout1('Pic1','thumbnails/66B27885.jpg');img SRC=thumbnails/66B27885.jpg NAME=Pic1 BORDER=0 /a/td BTW: it is valign=middle :P (I detest AllCaps and property=value instead of property=value.) That sounds like they wanted onMouseOver=msover1(...) It's also likely that msover1 is a Javascript function :-( Definitively, I would say. I can't call this a bug, but is Wget doing the right thing by ignoring the href altogether? Until there's an ESP package that can guess what the author intended, I doubt wget has any choice but to ignore the defective tag. *g* Seriously, I think you guys are too strict. Similar discussion have spawned numerous times. If the HTML code says a href=URL yaddayada my-Mother=Shopping%5 goingsupermarket/a Why can't wget just ignore everything after ...URL? Is there any instance where this would create unwanted behaviour for the user? It does not matter if there is a javascript called, a CSS broken, or the webmaster has bad breath. Now, if a mouseover picture is loaded, wget cannot retrieve it anyway, no matter if the javascript is correct or malformed, right? In addition, wget should send an email to webmaster@offending domain, complaining about the invalid HTML :-) /me signs this petition! In addition, mails should be written for bad (=unreadable) combos of font colour and background colour, animated gifs and blink tags. Kind regards Jens -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net
Re: A strange bit of HTML
[EMAIL PROTECTED] writes: Until there's an ESP package that can guess what the author intended, I doubt wget has any choice but to ignore the defective tag. Seriously, I think you guys are too strict. Similar discussion have spawned numerous times. If the HTML code says a href=URL yaddayada my-Mother=Shopping%5 goingsupermarket/a Why can't wget just ignore everything after ...URL? Because, as he said, Wget can parse text, not read minds. For example, you must know where a tag ends to be able to look for the next one, or to find comments. It is not enough to look for '' to determine the tag's ending -- something like img alt=my dog src=foo is a perfectly legal tag. In other words, you have to destructure the tag, not only to retrieve the URLs, but to be able to continue parsing. If the tag is not syntactically valid, the parsing fails, on to other tags. Wget has never been able to pick apart every piece of broken HTML. As for us being strict, I can only respond with a mini-rant... Wget doesn't create web standards, but it tries to support them. Spanning the chasm between the standards as written and the actual crap generated by HTML generators feels a lot like shoveling shit. Some amount of shoveling is necessary and is performed by all small programs to protect their users, but there has to be a point where you draw the line. There is only so much shit Wget can shovel. I'm not saying Ian's example is where the line has to be drawn. (Your example is equivalent to Ian's -- Wget would only choke on the last going part). But I'm sure that the line exists and that it is not far from those two examples.
Re: A strange bit of HTML
Hi Hrvoje! First, I did/do not mean to offend/attack you, just in case that my suspicion about you being pi55ed because of my post is not totally unjustified. If the HTML code says a href=URL yaddayada my-Mother=Shopping%5 goingsupermarket/a Why can't wget just ignore everything after ...URL? Because, as he said, Wget can parse text, not read minds. Ah *slapsforehead* /me stupid. For example, you must know where a tag ends to be able to look for the next one, or to find comments. It is not enough to look for '' to determine the tag's ending -- something like img alt=my dog src=foo is a perfectly legal tag. okok, granted, to dissolve a href=foo.html target=_topimg src=pic.htm.jpg name=index.html alt=oopsbr-fool.htm-/a for example, you'd really have a hard time, I suppose. I honestly did not think of people messing with and . As for us being strict, I can only respond with a mini-rant... Wget doesn't create web standards, but it tries to support them. Spanning the chasm between the standards as written and the actual crap generated by HTML generators feels a lot like shoveling shit. [rant name=my rant] Ah, tell me about it. Although I come from the other side (Trying to write my sites -with a text editor- so that they look ok on different browsers and remain HMTL compliant) I surely know how much 'fun' it can be to work with standards. Especially if they were set by a commitee as intelligent and just (as in justice) like W3C... BTW, as an engineering student I am fully aware how much help good standards can be. [/rant] Some amount of shoveling is necessary and is performed by all small programs to protect their users, but there has to be a point where you draw the line. There is only so much shit Wget can shovel. Unfortunately, the amount of shit on the web will not decrease. I fear that the opposite may be true. no, wait, I am pretty sure... I'm not saying Ian's example is where the line has to be drawn. (Your example is equivalent to Ian's -- Wget would only choke on the last going part). But I'm sure that the line exists and that it is not far from those two examples. Ok, but I understand you correctly that these two examples (mine was intended to be equivalent, but without JS) should be on the parse and retrieve side of this line, not the ignore and blame Frontpage side? CU Jens -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net