Sami Siren-2 wrote:
> 
> 
> Do you have urls of such bad content available to look at?
> 
> 

Thousands. Here is one:

http://www.valtravieso.com/ver_finca.phtml?idioma=1

The hrefs that have &sub in them get interpreted as the subset character
by tagsoup, and thus become broken links. With a few sites (and I think this
is one) the number of URLs will grow ad infinitum if the site handles the
"broken link" by returning something that works and uses the input link as a
base.

I believe I have some examples of Neko problems around as well, I've been
gathering test cases...

 -Doug
-- 
View this message in context: 
http://www.nabble.com/Anyone-looked-for-a-better-HTML-parser--tf4630266.html#a13235164
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Reply via email to