Hi , 
it may be because of the depth you specify is not able to reach the desired
page link, so you do some settings related with depth,threads at the time of
crawl.

like crawl -d urldir -dir crawl-dir -depth 20 -threads 10 -topN 50

try with increasing these values, might you get some good result.
and if I get some Updates regarding this,  I will let you know.

Thanks


Mike Howarth wrote:
> 
> I was wondering if anyone could help me.
> 
> I'm currently trying to get nutch to crawl a site I have. At the moment
> I'm pointing nutch at the root url e.g http://www.example.com
> 
> I know that I have over 130 links on the index page, however nutch is only
> finding 87 links. It appears that nutch stops crawling, the hadoop.log
> doesn't given any indication why this may occur.
> 
> I've amended my nutch-crawl to look like this:
> 
> # The url filter file used by the crawl command.
> 
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
> 
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
> 
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
> 
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|js)$
> 
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
> 
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> #-.*(/.+?)/.*?\1/.*?\1/
> 
> # accept hosts in MY.DOMAIN.NAME
> -^https:\/\/.*
> +.
> 
> # skip everything else
> #-^https://.*
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Crawl-not-crawling-entire-page-tf3446522.html#a9611824
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to