I suspect you may be better off asking this on the Nutch user list. The decisions you are describing will be within the Nutch codebase, not Solr. Someone here may know (hopefully) but you may get more support over on the Nutch list.
One suggestion -start with a clean, empty index. Run a crawl. Look at the maxDocs vs numDocs (visible via the admin UI for your core/collection). If maxDocs>numDocs, it means that some docs have been overwritten - i.e. the ID field that Nutch is using is not unique. Upayavira On Mon, Sep 28, 2015, at 10:19 AM, Daniel Holmes wrote: > Hi, > I am using apache Nutch 1.7 to crawl and apache Solr 4.7.2 for indexing. > In > my tests there is a gap between number of fetched results of Nutch and > number of indexed documents in Solr. For example one of the crawls is > fetched 23343 pages and 1146 images successfully while in the Solr 19250 > docs is indexed and 500 of them is image urls. > > My question is that what kind of pages are indexed is solr and why? > Does Solr index pages whit other status or not? > what kind of images does Solr index? > > Thanks.