Re: [Nutch-general] Nutch Crawler (.81) picking up strange links

Dennis Kubes Fri, 12 Jan 2007 13:45:13 -0800

It may be that javascript links are getting picked up.  If you take out 
just the the js part of parse-(text|html|js) so it looks like 
parse-(text|html) from your nutch-site.xml file then the javascript 
parse plugin it won't be loaded and it will not parse javascript links.


You may want to also limit file types with prefix, suffix, or regex 
filters.  Let me know if you need to know more about how to do that.

Dennis Kubes

Steve Kallestad wrote:
> I've implemented nutch as a site search to try it out.
> 
> When I crawl my own site with nutch, I end up with a strange set of links:
> 
> downloads/}).21()}),cr:(g(t){t.8.22().1f({2Y:(t.1i[0]-t.8.4u)+
> downloads/+aa[6u].ib().ia()+
> downloads/).30(/\\s+$/,
> 
> The list is huge, but it's a lot of the same.
> 
> I suspect that the links are coming from MediaWiki, but up until now I
> haven't seen any such links in my error logs.  It also makes the crawl take
> much longer than is really necessary.
> 
> I'm running the tutorial crawl.
> 

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Nutch Crawler (.81) picking up strange links

Reply via email to