Hi, > I am only interested in the internal links. Then db.ignore.external.links = false is correct.
It is impossible to decide what's going wrong. At a first glance, all seems ok except one: plugin.includes contains "scoring-optic". Should be "scoring-opic". I don't know but that hardly the reason. For a finer analysis, more details are required: - URL filter and normalizers: are the desired URLs accepted - CustomFetchSchedule.java: shouldFetch() may play a role You can try to find the reason by: % bin/nutch parsechecker "http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1" Are all desired outlinks extracted by parser? (after fetch of start url) % bin/nutch readdb .../crawldb -dump crawldb_dump % less crawldb_dump/part-* Are they in CrawlDb? Cheers, Sebastian On 10/13/2013 04:18 AM, S.L wrote: > Hello All, > > I am facing this problem with the URL > http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 , this URL has > many internal links present in the page and also has many external links > to other domains , I am only interested in the internal links. > > However when this page is crawled the internal links in it are not added > for fetching in the next round of fetching ( I have given a depth of 100). > I have alread set the db.ignore.internal.links as false ,but for some > reason the internal links are not getting added to the next round of fetch > list. > > > On the other hand if I set the db.ignore.external.links as false, it correctly > picks up all the external links from the page. > > This problem is not present in any other domains , can some tell me what is > it with this particular page ? > > I have also attached the nucth-site.xml that I am using for your review, > please advise. >

