Hi Alexander
I'm a wee bit lazy so I just run all my HTML text through Tidy (added
as a PHP extension) and it's a consistent base to start from. I
realise this isn't going to be possible in all environments but it
may be a good idea to check if it exists and 'sanitise' the HTML
input with this first. A fallback would be to perform some sort of
regex check on the HTML to see if it can be parsed but not to the
extent of DOCTYPE checking for HTML/XHTML. I imagine that true HTML
parsing would be a nightmare in itself.
Having said that I would say that if a document is too broken as to
be unreadable to regex checks then it doesn't deserve indexing. Regex
checking IMO is going to be somewhat more flexible that using DOM (as
not all HTML docs are going to be XHTML compliant) and is likely to
have better support. It's quick enough too.
I use this to grab all links from within a doc:-
function _extractLinks($body)
{
preg_match_all("/<(?:a|link|area)[^>+]href\s*=\s*\"([^\"]+)
\"/is", $body, $hrefMatches);
preg_match_all("/<(?:img|script|input|i?frame)[^>+]src\s*=\s*
\"([^\"]+)\"/is", $body, $srcMatches);
preg_match_all("/<(?:body|table|td|th)[^>+]background\s*=\s*
\"([^\"]+)\"/is", $body, $bgMatches);
$links = array();
$matches = array_map('trim', array_merge($hrefMatches[1],
$srcMatches[1], $bgMatches[1]));
foreach ($matches as $link) {
if (($link = $this->_parseLink($link)) !== false) { //
internal check for URI well-formedness
array_push($links, $link);
}
}
return $links;
}
and then use internal MIME-type checking to work out what is/isn't
parsable. $body is the entire text of a doc, not simply the
... content.
Hmmm... actually looking at this code it deserves some refinement but
that's for another day... But this could be a start.
I'd imaging that instead of returning an array of URIs it may be
better to return an iterator containing Zend_Uri objects?
I don't think it's really the responsibility of Zend_Search to pre-
check the validity of an HTML document, anyway. If it's created and
fed text then it should assume that the HTML is OK and do its duty.
Hope these random thoughts help - look forward to your next step
Hi Simon,
There was no HTML documents parsing/indexing capability in
Zend_Search up to now. But it's most common format for Internet :)
It's experimental now, so it's not documented and I didn't make any
announcement :)
I consider what should be used for this.
1) Pure PHP parser gives possibility to implement just what we
want. The question is a performance if we plan to index a lot of
documents.
2) DOM HTML parsing functionality. It's good and fast. It also
allows to use XPath expressions to retrieve any part of a document.
But parsing behavior is not under control, ex. it doesn't recognize
document encoding in some cases.
3) regex's
I have some scepsis with this.
There are some non-trivial cases like parsing time encoding
recognition, non-matched tags, scripts, '<' sign within script
strings, escaped quotas, double/single quotas usage and so on.
I've never seen non-buggy regex, which parses all these things
correctly. :(
Any ideas?
With best regards,
Alexander Veremyev.
Simon Mundy wrote:
Hi Alexander
Just noticed a new HTML document component in Zend_Search. Is this
the start of the killer ZF-powered spider? :)
Would be very keen to know how you intended to use it as I've
implemented a spider of sorts that can parse HTML and PDF files
but is probably a little limited in its scope. And I use regex's
instead of the more clean usage of the DOM library that you have...
Cheers
--
Simon Mundy | Director | PEPTOLAB
""" " "" "" "" "" """ " "" " " " " "" "" "
202/258 Flinders Lane | Melbourne | Victoria | Australia | 3000
Voice +61 (0) 3 9654 4324 | Mobile 0438 046 061 | Fax +61 (0) 3
9654 4124
http://www.peptolab.com
--
Simon Mundy | Director | PEPTOLAB
""" " "" "" "" "" """ " "" " " " " "" "" "
202/258 Flinders Lane | Melbourne | Victoria | Australia | 3000
Voice +61 (0) 3 9654 4324 | Mobile 0438 046 061 | Fax +61 (0) 3 9654
4124
http://www.peptolab.com