Re: [Nutch-general] Limiting crawl to specific list of URLS

Nitin Borwankar Wed, 29 Nov 2006 15:40:32 -0800

Kevvin Sevvvin wrote:

>
> Hi Everybody,
>
> I'm real new to Nutch. I've read through the documentation and many  
> months
> of mailinglist archives and I don't think this question has been  
> answered.
>
> I have two tasks I would like Nutch to handle. I would like it to  
> crawl and
> index ONLY a specific set of urls. This is a stronger limitation that
> confining to specific sites (so db.ignore.external.links is  
> insufficient): it
> should not follow ANY links on pages in the list of urls.
>
> Secondly, after creating the crawl and index of specific sites, I  
> would like
> to occasionally add SINGLE urls to the index.
>
> Is this possible? If so, is it trivially possible with something like  
> '--topN 0'
> (or should that be '--topN 1' ??) ? Or could I create a single local  
> web page
> with all the links on it and run the crawler with '-depth 1' ?
>
> Apologies if this is an overasked or misguided question; if so I'd  
> appreciate
> pointers to appropriate documentation or code so I can figure it out  
> on my own.
>
> Thanks!
> -k7


Hi Kevin,

I am a relative newbie to Nutch as well.
I believe you are looking for --depth=1 which will not follow the URL's
I have used --depth ( but not with value 1 ) on the 0.7.2 version.


Nitin Borwankar
http://tagschema.com




-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Limiting crawl to specific list of URLS

Reply via email to