On Fri, 2006-12-08 at 12:41 +0100, Andrzej Bialecki wrote:
> Yes, most likely. Running complex regexes on hostile data, such as 
> unknown URLs, quite often ends up like this - that's why many 
> Internet-wide installations don't use regexes but combinations of 
> prefix/suffix/custom filters.. If you were running the fetcher in 
> non-parsing mode, this wouldn't happen during fetching but during 
> parsing - and you could've changed your config and restart just the 
> parsing, without refetching ... ah well.
> 
> Anyway - it's most likely not hung, but runs very, very slowly. You 
> could give it a chance and let it run a few hours more,  perhaps it will 
> go past these troublesome urls, and keep watching the size of temporary 
> data - if the files are not growing at all, then I'm afraid you will 
> have to kill the job, and avoid your boss for a couple of days ... :/
> 
> (By the way, one can encounter most weird things in the wild ... I've 
> seen URLs that are several kilobytes long, containing all sorts of 
> illegal characters, containing nested unescaped URLs with invalid 
> protocols and so and so on ... so, when crawling Internet at large you 
> should be prepared for getting really nasty stuff. Complex regexes don't 
> cut it).


I see, thanks. Ah well. The scope my regex is simply
glob("http://*.uk/*";). What filters would you recommend for doing this?

I'm guessing my use-case is pretty much the same as everyone else -
people who want everything from a domain. Is it wise to ship with the
regex urlfilter as the default filter?

Anyway, any help would be great. I'll keep an eye on the temp data. If
it rises I'll probably leave it going.

Thanks

-Rob


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to