Thanks for the response I've already played around with differing depths generally from 3 to 10 and have had no distinguisable difference in results.
Furthermore I've tried running the search with the topN and omitting the flag with little difference. Anymore ideas? Ratnesh,V2Solutions India wrote: > > Hi , > it may be because of the depth you specify is not able to reach the > desired page link, so you do some settings related with depth,threads at > the time of crawl. > > like crawl -d urldir -dir crawl-dir -depth 20 -threads 10 -topN 50 > > try with increasing these values, might you get some good result. > and if I get some Updates regarding this, I will let you know. > > Thanks > > > Mike Howarth wrote: >> >> I was wondering if anyone could help me. >> >> I'm currently trying to get nutch to crawl a site I have. At the moment >> I'm pointing nutch at the root url e.g http://www.example.com >> >> I know that I have over 130 links on the index page, however nutch is >> only finding 87 links. It appears that nutch stops crawling, the >> hadoop.log doesn't given any indication why this may occur. >> >> I've amended my nutch-crawl to look like this: >> >> # The url filter file used by the crawl command. >> >> # Better for intranet crawling. >> # Be sure to change MY.DOMAIN.NAME to your domain name. >> >> # Each non-comment, non-blank line contains a regular expression >> # prefixed by '+' or '-'. The first matching pattern in the file >> # determines whether a URL is included or ignored. If no pattern >> # matches, the URL is ignored. >> >> # skip file:, ftp:, & mailto: urls >> -^(file|ftp|mailto): >> >> # skip image and other suffixes we can't yet parse >> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|js)$ >> >> # skip URLs containing certain characters as probable queries, etc. >> [EMAIL PROTECTED] >> >> # skip URLs with slash-delimited segment that repeats 3+ times, to break >> loops >> #-.*(/.+?)/.*?\1/.*?\1/ >> >> # accept hosts in MY.DOMAIN.NAME >> -^https:\/\/.* >> +. >> >> # skip everything else >> #-^https://.* >> >> > > -- View this message in context: http://www.nabble.com/Crawl-not-crawling-entire-page-tf3446522.html#a9611839 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
