date:20010407

HTML::Parser: report implicit events

2001-04-07 Thread Bjoern Hoehrmann


Hi,

I wonder, why HTML::Parser does not report implicit events. A conforming
parser should report them in order to insure, that a correct parse tree
could be build. An example:

  !DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
  title/title
  psome textimg alt='' src='' more texth1heading/h1

should report (omitting text and possibly default events)

  declaration
  start (html)
start (head)
  start (title)
  end (title)
end (head)
start (body)
  start (p)
start (img)
end (img)
  end (p)
  start (h1)
  end (h1)
end (body)
  end (html)

I request an option to get those events.
-- 
Bjrn Hhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de
am Badedeich 7  Telefon: +49(0)4667/981028  http://bjoern.hoehrmann.de
25899 Dagebll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote [!]e
-- listen, learn, contribute -- David J. Marcus

HTML::Tagset: p_closure_barriers

2001-04-07 Thread Bjoern Hoehrmann


Hi,

the HTML::Tagset manual defines the array
@HTML::Tagset::p_closure_barriers. I don't understand the rationale
behind this. The given example:

  html
head
  titlefoo/title
/head
body
  pfoo
table
  tr
td
   foo
   pbar
/td
  /tr
/table
  /p
/body
  /html

_isn't_ legal. In SGML elements, that have optional end-tags are
implicitly closed, if an element, that could not be contained inside the
element (i.e. that is not part of the content model) occurs. Try to
validate the example at [1] and you'll get 

Line 17, character 8: 
/p
   ^Error: end tag for element P which is not open; try removing the
end tag or check for improper nesting of elements

The parse tree of the document is something like

  html
head
  title
 foo
  /title
/head
body
  p
 foo
  /p
  table
tbody
  tr
td
 foo
  p
   bar
  /p
/td
  /tr
/tbody
  /table
  /p
  
   error, the element was already closed

/body
  /html

[1] http://www.htmlhelp.com/tools/validator/direct.html
-- 
Bjrn Hhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de
am Badedeich 7  Telefon: +49(0)4667/981028  http://bjoern.hoehrmann.de
25899 Dagebll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote [!]e
-- listen, learn, contribute -- David J. Marcus

HTML::TreeBuilder method/madness. Was: HTML::Tagset: p_closure_barriers

2001-04-07 Thread Sean M. Burke


At 05:25 PM 2001-04-07 +0200, you wrote:
the HTML::Tagset manual defines the array
@HTML::Tagset::p_closure_barriers. I don't understand the rationale
behind this. The given example: [...]
_isn't_ legal. In SGML elements, that have optional end-tags are
implicitly closed, if an element, that could not be contained inside the
element (i.e. that is not part of the content model) occurs. Try to
validate the example at [1] and you'll get 
^Error: end tag for element P which is not open; try removing the
end tag or check for improper nesting of elements

As far as I understand SGML,

A. SGML is a family of markup languages where, when (start|end)-tag
omission is enabled, nothing can be parsed witout a DTD.

B. For basically every SGML document, a complete (!) DTD exists, and must
exist (whether externally or internally).

C. Part of SGML parsing involves assuming that the input conforms to the
declared DTD, and trying to find a way (implicitly opening here, implicitly
closing there) to parse in a way the conforms to content models.  But if
this is not possible, then the parser REJECTS the document.

D. Moving here to the user model: people who write SGML do so with a solid
mental model of the DTD, and/or use an editor that restricts them to things
that are DTD-valid.

E. People who write SGML validate their documents, since they know that an
invalid document is to be rejected.

But, observation being the key to scientific discovery, we note that these
points A-E do /not/ describe the current (or past, or any likely future)
situation with HTML.


Moreover, to quote HTML::TreeBuilder:
"Now, this would all work flawlessly and unproblematically if: 1) all the
rules that both prescribe and describe HTML were (and had been) clearly set
out, and 2) everyone was aware of these rules and wrote their code in
compliance to them. 

However, it didn't happen that way, and so most HTML pages are difficult if
not impossible to correctly parse with nearly any set of straightforward
SGML rules. That's why the internals of HTML::TreeBuilder consist of lots
and lots of special cases -- instead of being just a generic SGML parser
with HTML DTD rules plugged in."



So, quod est demonstratum:  HTML is not SGML.
A design goal of HTML was, as Tim Berners-Lee said, to /look/ like SGML.
Well, it worked.  And within a few years, he was telling people "oh, it's
SGML, it's SGML!" but by then it was clearly too late.  HTML is /defined/
as SGML.  But "define" is a complicated concept -- sometimes defining
things is descriptive, sometimes it's prescriptive, sometimes it's just a
rule of thumb, and usually it's underspecified anyway.  Yes, it's not nice
when data formats get this messy, but here again I find the phrase "too
late" applicable.


My design goals in rewriting HTML::TreeBuilder were:
1) it should make good sense of good code.
2) it should make the best of code that's not uniformly good.
3) it does not have the option of rejecting a document.

(So, yes, this is like a natural language problem.  When you're reading
English, and you hit a sentence that your grammar doesn't accept, you don't
yell "ERROR! ERROR! ERROR!" and stop reading.  Well, at least /I/ don't.)


Now, exactly what the HTML parse tree should look like, is another matter.
When (re)developing HTML::TreeBuilder, I looked a /lot/ of randomly chosen
code off the Web, fed it thru test versions of HTML::TreeBuilder (which I'd
put in debug mode), and when it went wrong, I'd try to make sense of where
TreeBulider assumptions didn't match what was clearly in the head of the
person designing the document.

Sometimes there was nothing to do but say "that part of the document is
just wrong" and move on, and sometimes it was a matter that the tokenizer
(HTML::Parser) decided on, and so there was little to be done for it in
TreeBuilder.

But now and then, an oddity in parsing would make me think that one of
HTML::TreeBuilder's assumptions was wrong, and I'd have to change it (and
often try several possibilities before finding something that was right on
this point, and didn't mess other things up).

And somewhere in iterating thru that process, I decided that the W3C HTML
DTD's ideas of things were clearly at odds with the mental models of users
who wrote the basically decent code that I was trying to parse; and
moreover, where the DTDs would not reject such documents, they would wildly
misconstrue it -- mostly good code with a bit of bad code in it, would make
for a completely crazy parse tree.  And no-one wants that.
So I decided that the DTDs and what they said, were useless for what I was
doing, and I charged on without them.  At points, that meant having to code
up little sets of elements that behaved a certain way, like
@HTML::Tagset::p_closure_barriers.  Often, putting an element in a set made
some parses good, but broke others; while taking it out broke yet more
parses.  In those cases, I tried to make the design goals apply for the
maximal number of input documents.

Re: Problems submitting form with form-click (repost)

2001-04-07 Thread Steve Borruso


 What does $form-click by itself return?  (The docs are not clear).

 print STDERR $form-click;

It prints -  HTTP::Request=HASH(0x34cd40)

I don't understand what this is supposed to be telling me. Seems a bit
cryptic.

HTML::Parser: report implicit events

HTML::Tagset: p_closure_barriers

HTML::TreeBuilder method/madness. Was: HTML::Tagset: p_closure_barriers

Re: Problems submitting form with form-click (repost)

4 matches

Site Navigation

Mail list logo

Footer information