Hi, I know that there is an issue with Nutch when crawling the local filesystem, where it also crawls the parent directory.
So, I did a test today, where I put the directory that I actually wanted to crawl under a directory, e.g., I had: /testfiles /testfiles/foo ==> contained content to be crawled. My thinking was that even with the Nutch issue it would just crawl /testfiles directory, which was empty, and we'd be ok. However, when I reviewed the nutch log, I saw that it was also fetching directories like /opt/, /tmp/, etc. Mind you it didn't fetch any of the CONTENTS of those directories, but it did fetch those directories themselves. Has anyone else noticed this behavior? Also, with the suggested change to "org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse" at: http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch fix this problem? Thanks, Jim
