Hi David Thanks for the response. As you suggest I'm already deleting after the index but it doesn't feel "right". There must be a better way.
P On 24 July 2014 14:27, David Lachut <[email protected]> wrote: > I use Elasticsearch instead of Solr, and there's likely a better way for > both of us, but the solution I've settled on for now is a script to remove > these from the index. Next step for me would be to define a percolator to > recognize these pages and remove them more automatically. > > David > > -----Original Message----- > From: Paul Rogers [mailto:[email protected]] > Sent: Thursday, July 24, 2014 2:48 PM > To: [email protected] > Subject: How to avoid indexing directory listings with nutch/solr > > Hi all > > I'm trying to set up Nutch (1.8) and solr (4.7.0) to index the content of > the documents on a web site (running on the same PC). The documents are > pdf's. > > The seed url is http://localhost/documents/ > > which is a directory containing other (nested) directories all containing > a varying number of pdf's. > > Everything works fine but Solr also contains a document for each of the > directory listings. > > I understand that I have to have nutch crawl the directories to be able to > follow the "links" in them to the other directories and the documents > themselves. I assume (tho' I haven't tried it) that if I filter the > directories out in the regex-urlfilter.xml then nutch will ignore them > completely and not follow the "links" to the other directories and the > files. > > So my question is: Is it possible to filter the "documents" nutch parses > and only send a sub set of them to Solr for indexing? > > Many thanks > > P >

