Hi,

I know that there is an issue with Nutch when crawling the local filesystem, 
where it also crawls the parent directory.

So, I did a test today, where I put the directory that I actually wanted to 
crawl under a directory, e.g., I had:

/testfiles
/testfiles/foo ==> contained content to be crawled.

My thinking was that even with the Nutch issue it would just crawl /testfiles 
directory, which was empty, and we'd be ok.

However, when I reviewed the nutch log, I saw that it was also fetching 
directories like /opt/, /tmp/, etc.  Mind you it didn't fetch any of the 
CONTENTS of those directories, but it did fetch those directories themselves.

Has anyone else noticed this behavior? 

Also, with the suggested change to 
"org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse" at:

http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch

fix this problem?

Thanks,
Jim


Reply via email to