Hi Alexander
I'm a wee bit lazy so I just run all my HTML text through Tidy (added
as a PHP extension) and it's a consistent base to start from. I
realise this isn't going to be possible in all environments but it
may be a good idea to check if it exists and 'sanitise' the HTML
input with
Hi Simon,
There was no HTML documents parsing/indexing capability in Zend_Search
up to now. But it's most common format for Internet :)
It's experimental now, so it's not documented and I didn't make any
announcement :)
I consider what should be used for this.
1) Pure PHP parser gives possi
Hi Alexander
Just noticed a new HTML document component in Zend_Search. Is this
the start of the killer ZF-powered spider? :)
Would be very keen to know how you intended to use it as I've
implemented a spider of sorts that can parse HTML and PDF files but
is probably a little limited in i