Re: A strange bit of HTML

2002-01-16 Thread Hrvoje Niksic

Ian Abbott [EMAIL PROTECTED] writes:

 I came across this extract from a table on a website:
 
 td ALIGN=CENTER VALIGN=CENTER WIDTH=120 HEIGHT=120a
 href=66B27885.htm msover1('Pic1','thumbnails/MO66B27885.jpg');
 onMouseOut=msout1('Pic1','thumbnails/66B27885.jpg');img
 SRC=thumbnails/66B27885.jpg NAME=Pic1 BORDER=0 /a/td
 
 Note the string beginning msover1(, which seems to be an
 attribute value without a name, so that makes it illegal HTML.

I think it's even worse than that.  My limited knowledge of SGML
taught me that foo bar is equivalent to foo bar=bar, which means
that given foo bar, bar is attribute *name*, not value.

If I understand SGML correctly, attribute names cannot be quoted.
This makes foo bar illegal even if foo bar=10 or foo bar are
perfectly valid.

 I haven't traced what Wget is actually doing when it encounters
 this, but it doesn't treat 66B27885.htm as a URL to be
 downloaded.

According to Wget's notion of HTML, the A tag in question is simply
not a well-formed tag.  This means that Wget's parser will back out
to the character a (the second char of a href=...) and continue
parsing from there.  Generally, when faced with a syntax error, it is
extremely hard to just ignore it and extract a useful result from
garbage.  In some cases it's possible; in most, it's just too much
worse.

Loosely, html-parse.c will recognize the following things as tags.  (S
stands for strict string, only letters, numbers, hyphen and
underscore allowed, L stands for loosely matched string,
i.e. everything except whitespace and separator, such as quote, ,
etc.)

 I can't call this a bug, but is Wget doing the right thing by
 ignoring the href altogether?

S S1=L1 S2=L2 ... -- normal tag with attributes
S S1=L1 S2=L2 ... -- like the above, but quotation allows more
   leeway on values.
S S1  -- the same as S S1=S1

Given the amount of broken HTML on the web, it's easy to imagine for
this parser to be confused about what's what.  That is why the
attribute names are matched strictly.

Now, it would be fairly easy to change the parser to match the
attribute names loosely like it does for values, but to parse the
above piece of broken HTML, it would have to be extended to handle:

S L1

(and, I assume)

S L1=L2

I wonder if that's worth it.  On the one hand, it might be helpful to
someone (e.g. you).  On the other hand, there will always be one more
piece of illegal HTML that Wget *could* handle if tweaked hard enough.



Re: A strange bit of HTML

2002-01-16 Thread Hrvoje Niksic

[EMAIL PROTECTED] writes:

 That sounds like they wanted onMouseOver=msover1(...)

Which Wget would, by the way, have handled perfectly.



Re: A strange bit of HTML

2002-01-16 Thread jens . roesner

Hi there!

 td ALIGN=CENTER VALIGN=CENTER WIDTH=120 HEIGHT=120a
 href=66B27885.htm msover1('Pic1','thumbnails/MO66B27885.jpg');
 onMouseOut=msout1('Pic1','thumbnails/66B27885.jpg');img
 SRC=thumbnails/66B27885.jpg NAME=Pic1 BORDER=0 /a/td

BTW: it is valign=middle :P
(I detest AllCaps and property=value instead of property=value.)

 That sounds like they wanted onMouseOver=msover1(...)
 It's also likely that msover1 is a Javascript function :-(
Definitively, I would say.


 I can't call this a bug, but is Wget doing the right thing by
 ignoring the href altogether?
 Until there's an ESP package that can guess what the author intended,
 I doubt wget has any choice but to ignore the defective tag. 
*g*
Seriously, I think you guys are too strict.
Similar discussion have spawned numerous times.
If the HTML code says 
a href=URL yaddayada my-Mother=Shopping%5 goingsupermarket/a
Why can't wget just ignore everything after ...URL?
Is there any instance where this would create unwanted behaviour 
for the user? It does not matter if there is a javascript called, a CSS
broken, or the webmaster has bad breath.
Now, if a mouseover picture is loaded, 
wget cannot retrieve it anyway, no matter if the javascript 
is correct or malformed, right?

 In addition,
 wget should send an email to webmaster@offending domain,
 complaining about the invalid HTML :-)
/me signs this petition!
In addition, mails should be written for bad (=unreadable) 
combos of font colour and background colour, 
animated gifs and blink tags.

Kind regards
Jens

-- 
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net




Re: A strange bit of HTML

2002-01-16 Thread Hrvoje Niksic

[EMAIL PROTECTED] writes:

 Until there's an ESP package that can guess what the author
 intended, I doubt wget has any choice but to ignore the defective
 tag.
 
 Seriously, I think you guys are too strict.
 Similar discussion have spawned numerous times.
 If the HTML code says 
 a href=URL yaddayada my-Mother=Shopping%5 goingsupermarket/a
 Why can't wget just ignore everything after ...URL?

Because, as he said, Wget can parse text, not read minds.  For
example, you must know where a tag ends to be able to look for the
next one, or to find comments.  It is not enough to look for '' to
determine the tag's ending -- something like img alt=my dog
src=foo is a perfectly legal tag.

In other words, you have to destructure the tag, not only to retrieve
the URLs, but to be able to continue parsing.  If the tag is not
syntactically valid, the parsing fails, on to other tags.  Wget has
never been able to pick apart every piece of broken HTML.

As for us being strict, I can only respond with a mini-rant...

Wget doesn't create web standards, but it tries to support them.
Spanning the chasm between the standards as written and the actual
crap generated by HTML generators feels a lot like shoveling shit.
Some amount of shoveling is necessary and is performed by all small
programs to protect their users, but there has to be a point where you
draw the line.  There is only so much shit Wget can shovel.

I'm not saying Ian's example is where the line has to be drawn.  (Your
example is equivalent to Ian's -- Wget would only choke on the last
going part).  But I'm sure that the line exists and that it is not
far from those two examples.



Re: A strange bit of HTML

2002-01-16 Thread jens . roesner

Hi Hrvoje!

First, I did/do not mean to offend/attack you, 
just in case that my suspicion about you being 
pi55ed because of my post is not totally unjustified.

  If the HTML code says 
  a href=URL yaddayada my-Mother=Shopping%5 goingsupermarket/a
  Why can't wget just ignore everything after ...URL?
 
 Because, as he said, Wget can parse text, not read minds.  
Ah *slapsforehead* /me stupid.

 For
 example, you must know where a tag ends to be able to look for the
 next one, or to find comments.  It is not enough to look for '' to
 determine the tag's ending -- something like img alt=my dog
 src=foo is a perfectly legal tag.
okok, granted, to dissolve
a href=foo.html target=_topimg src=pic.htm.jpg name=index.html
alt=oopsbr-fool.htm-/a
for example, you'd really have a hard time, I suppose.
I honestly did not think of people messing with  and .

 As for us being strict, I can only respond with a mini-rant...
 Wget doesn't create web standards, but it tries to support them.
 Spanning the chasm between the standards as written and the actual
 crap generated by HTML generators feels a lot like shoveling shit.
[rant name=my rant]
Ah, tell me about it. Although I come from the other side 
(Trying to write my sites -with a text editor- so that they look ok on
different browsers and remain HMTL compliant) I surely know how much 'fun' it can
be to work with standards.
Especially if they were set by a commitee as intelligent and just (as in
justice) like W3C...
BTW, as an engineering student I am fully aware how much 
help good standards can be.
[/rant]


 Some amount of shoveling is necessary and is performed by all small
 programs to protect their users, but there has to be a point where you
 draw the line.  There is only so much shit Wget can shovel.
Unfortunately, the amount of shit on the web will not decrease.
I fear that the opposite may be true.
no, wait, I am pretty sure...

 I'm not saying Ian's example is where the line has to be drawn.  (Your
 example is equivalent to Ian's -- Wget would only choke on the last
 going part).  But I'm sure that the line exists and that it is not
 far from those two examples.
Ok, but I understand you correctly that these two examples (mine was
intended to be equivalent, but without JS) should be on the parse and retrieve
side of this line, not the ignore and blame Frontpage side?

CU
Jens

-- 
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net