Theoretically because XHTML is really just a subset of XML any XML parser *should* be able to parse XHTML with no problem. HTML gets a little more complicated as it does not have to be properly formed. I am not sure if there is a module that will handle all types, the easy way I would think to handle this is to fetch the source, check for a document type in the first line, XML files *must* contain this line and must state whether it is XML/XHTML, then if it is parse with your XML module, otherwise assume it is HTML and parse with one of those modules.

http://danconia.org
(on a side note you can use the first page at this location as a test if you wish, it is well formed XHTML, I am still working on the rest of the site ;-))

Octavian Rasnita wrote:
Hi all,

I've seen more modules like HTML::... and others that understand the
structure of an HTML document.

If I want to create a web spider that parses more web pages, how can I parse
them if they are in diverse formats?
Some of them might be using the HTML old format, others the XHTML, ... and
so on.

Are these modules (HTML::...) understanding the structure of all those file
formats?

If not, is there any module which does this?
If not, should I use more modules and use methods from all of them?

I am a little bit confused.

Thank you for any hints.

Teddy,
Teddy's Center: http://teddy.fcc.ro/
Email: [EMAIL PROTECTED]




--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to