How to avoid indexing directory listings with nutch/solr

Paul Rogers Thu, 24 Jul 2014 11:48:34 -0700

Hi all

I'm trying to set up Nutch (1.8) and solr (4.7.0) to index the content of
the documents on a web site (running on the same PC).  The documents are
pdf's.


The seed url is http://localhost/documents/

which is a directory containing other (nested) directories all containing a
varying number of pdf's.

Everything works fine but Solr also contains a document for each of the
directory listings.

I understand that I have to have nutch crawl the directories to be able to
follow the "links" in them to the other directories and the documents
themselves.  I assume (tho' I haven't tried it) that if I filter the
directories out in the regex-urlfilter.xml then nutch will ignore them
completely and not follow the "links" to the other directories and the
files.

So my question is:  Is it possible to filter the "documents" nutch parses
and only send a sub set of them to Solr for indexing?

Many thanks

P

How to avoid indexing directory listings with nutch/solr

Reply via email to