Hi all I'm trying to set up Nutch (1.8) and solr (4.7.0) to index the content of the documents on a web site (running on the same PC). The documents are pdf's.
The seed url is http://localhost/documents/ which is a directory containing other (nested) directories all containing a varying number of pdf's. Everything works fine but Solr also contains a document for each of the directory listings. I understand that I have to have nutch crawl the directories to be able to follow the "links" in them to the other directories and the documents themselves. I assume (tho' I haven't tried it) that if I filter the directories out in the regex-urlfilter.xml then nutch will ignore them completely and not follow the "links" to the other directories and the files. So my question is: Is it possible to filter the "documents" nutch parses and only send a sub set of them to Solr for indexing? Many thanks P

