Re: What kind of nutch documents does Solr index?

Upayavira Mon, 28 Sep 2015 04:33:34 -0700

I suspect you may be better off asking this on the Nutch user list. The
decisions you are describing will be within the Nutch codebase, not
Solr. Someone here may know (hopefully) but you may get more support
over on the Nutch list.

One suggestion -start with a clean, empty index. Run a crawl. Look at
the maxDocs vs numDocs (visible via the admin UI for your
core/collection). If maxDocs>numDocs, it means that some docs have been
overwritten - i.e. the ID field that Nutch is using is not unique.

Upayavira

On Mon, Sep 28, 2015, at 10:19 AM, Daniel Holmes wrote:
> Hi,
> I am using apache Nutch 1.7 to crawl and apache Solr 4.7.2 for indexing.
> In
> my tests there is a gap between number of fetched results of Nutch and
> number of indexed documents in Solr. For example one of the crawls is
> fetched 23343 pages and 1146 images successfully while in the Solr 19250
> docs is indexed and 500 of them is image urls.
> 
> My question is that what kind of pages are indexed is solr and why?
> Does Solr index pages whit other status or not?
> what kind of images does Solr index?
> 
> Thanks.

Re: What kind of nutch documents does Solr index?

Reply via email to