Question about crawling local filesystem and directories

ohaya Thu, 16 Jul 2009 13:58:19 -0700

Hi,

I know that there is an issue with Nutch when crawling the local filesystem, 
where it also crawls the parent directory.


So, I did a test today, where I put the directory that I actually wanted to 
crawl under a directory, e.g., I had:

/testfiles
/testfiles/foo ==> contained content to be crawled.

My thinking was that even with the Nutch issue it would just crawl /testfiles 
directory, which was empty, and we'd be ok.

However, when I reviewed the nutch log, I saw that it was also fetching 
directories like /opt/, /tmp/, etc.  Mind you it didn't fetch any of the 
CONTENTS of those directories, but it did fetch those directories themselves.

Has anyone else noticed this behavior? 

Also, with the suggested change to 
"org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse" at:

http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch

fix this problem?

Thanks,
Jim

Question about crawling local filesystem and directories

Reply via email to