Re: how to crawl a page but not index it

Beats Tue, 14 Jul 2009 05:33:15 -0700

hi,

actually what i want is to crawl a web page say 'page A' and all its
outlinks.
i want to index all the content gathered by crawling the outlinks. But not
the 'page A'.
is there any way to do it in single run.


with Regards

Beats
be...@yahoo.com



SunGod wrote:
> 
> 1.create work dir test first
> 
> 
> 2.insert url
> ../bin/nutch inject test -urlfile urls
> 
> 3.create fetchlist
> ../bin/nutch generate test test/segments
> 
> 4.fetch url
> s1=`ls -d crawl/segments/2* | tail -1`
> echo $s1
> ../bin/nutch fetch test/segments/20090628160619
> 
> 5.update crawldb
> ../bin/nutch updatedb test test/segments/20090628160619
> 
> loop step 3 - 5, write a bash script running is best!
> 
> next time please use google search first
> 
> 2009/7/13 Beats <tarun_agrawal...@yahoo.com>
> 
>>
>> can anyone help me on this..
>>
>> i m using solr to index the nutch doc.
>> So i think prune tool will not work.
>>
>> i do not want to index the document taken from a particular set of sites
>>
>> with regards Beats
>> --
>> View this message in context:
>> http://www.nabble.com/how-to-crawl-a-page-but-not-index-it-tp24437901p24459435.html
>>  Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/how-to-crawl-a-page-but-not-index-it-tp24437901p24478530.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how to crawl a page but not index it

Reply via email to