Hi David

Thanks for the response.  As you suggest I'm already deleting after the
index but it doesn't feel "right".  There must be a better way.

P


On 24 July 2014 14:27, David Lachut <[email protected]> wrote:

> I use Elasticsearch instead of Solr, and there's likely a better way for
> both of us, but the solution I've settled on for now is a script to remove
> these from the index. Next step for me would be to define a percolator to
> recognize these pages and remove them more automatically.
>
> David
>
> -----Original Message-----
> From: Paul Rogers [mailto:[email protected]]
> Sent: Thursday, July 24, 2014 2:48 PM
> To: [email protected]
> Subject: How to avoid indexing directory listings with nutch/solr
>
> Hi all
>
> I'm trying to set up Nutch (1.8) and solr (4.7.0) to index the content of
> the documents on a web site (running on the same PC).  The documents are
> pdf's.
>
> The seed url is http://localhost/documents/
>
> which is a directory containing other (nested) directories all containing
> a varying number of pdf's.
>
> Everything works fine but Solr also contains a document for each of the
> directory listings.
>
> I understand that I have to have nutch crawl the directories to be able to
> follow the "links" in them to the other directories and the documents
> themselves.  I assume (tho' I haven't tried it) that if I filter the
> directories out in the regex-urlfilter.xml then nutch will ignore them
> completely and not follow the "links" to the other directories and the
> files.
>
> So my question is:  Is it possible to filter the "documents" nutch parses
> and only send a sub set of them to Solr for indexing?
>
> Many thanks
>
> P
>

Reply via email to