Re: [Nutch-general] Crawl not crawling entire page

Mike Howarth Thu, 22 Mar 2007 04:32:39 -0800

Thanks for the response

I've already played around with differing depths generally from 3 to 10 and
have had no distinguisable difference in results.


Furthermore I've tried running the search with the topN and omitting the
flag with little difference.

Anymore ideas?


Ratnesh,V2Solutions India wrote:
> 
> Hi , 
> it may be because of the depth you specify is not able to reach the
> desired page link, so you do some settings related with depth,threads at
> the time of crawl.
> 
> like crawl -d urldir -dir crawl-dir -depth 20 -threads 10 -topN 50
> 
> try with increasing these values, might you get some good result.
> and if I get some Updates regarding this,  I will let you know.
> 
> Thanks
> 
> 
> Mike Howarth wrote:
>> 
>> I was wondering if anyone could help me.
>> 
>> I'm currently trying to get nutch to crawl a site I have. At the moment
>> I'm pointing nutch at the root url e.g http://www.example.com
>> 
>> I know that I have over 130 links on the index page, however nutch is
>> only finding 87 links. It appears that nutch stops crawling, the
>> hadoop.log doesn't given any indication why this may occur.
>> 
>> I've amended my nutch-crawl to look like this:
>> 
>> # The url filter file used by the crawl command.
>> 
>> # Better for intranet crawling.
>> # Be sure to change MY.DOMAIN.NAME to your domain name.
>> 
>> # Each non-comment, non-blank line contains a regular expression
>> # prefixed by '+' or '-'.  The first matching pattern in the file
>> # determines whether a URL is included or ignored.  If no pattern
>> # matches, the URL is ignored.
>> 
>> # skip file:, ftp:, & mailto: urls
>> -^(file|ftp|mailto):
>> 
>> # skip image and other suffixes we can't yet parse
>> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|js)$
>> 
>> # skip URLs containing certain characters as probable queries, etc.
>> [EMAIL PROTECTED]
>> 
>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>> loops
>> #-.*(/.+?)/.*?\1/.*?\1/
>> 
>> # accept hosts in MY.DOMAIN.NAME
>> -^https:\/\/.*
>> +.
>> 
>> # skip everything else
>> #-^https://.*
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Crawl-not-crawling-entire-page-tf3446522.html#a9611839
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Crawl not crawling entire page

Reply via email to