Hello, 

I am crawling my local File System using nutch 1.0

in my url file, i put
file://localhost/data/

During the crawl, nuch is parsing the directories in order to find oulinks, 
wich is allright :

2009-04-29 15:17:56,468 INFO  fetcher.OldFetcher - fetching 
file://localhost/data/
2009-04-29 15:17:56,793 DEBUG file.File - fetching file://localhost/data/
2009-04-29 15:18:18,501 DEBUG parse.ParseUtil - Parsing 
[file://localhost/data/] with [org.apache.nutch.parse.html.htmlpar...@e53220]
2009-04-29 15:18:18,511 DEBUG util.EncodingDetector - file://localhost/data/: 
charset windows-1252 (detect,33% confidence)
2009-04-29 15:18:18,511 DEBUG util.EncodingDetector - file://localhost/data/: 
charset windows-1252 (detect,32% confidence)
2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/: 
charset windows-1252 (detect,32% confidence)
2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/: 
charset windows-1252 (detect,29% confidence)
2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/: 
charset windows-1252 (detect,23% confidence)
2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/: 
charset windows-1252 (detect,20% confidence)
2009-04-29 15:18:18,513 DEBUG util.EncodingDetector - file://localhost/data/: 
charset windows-1252 (detect,19% confidence)
2009-04-29 15:18:18,513 DEBUG util.EncodingDetector - file://localhost/data/: 
charset iso-8859-2 (detect, 16% confidence)
2009-04-29 15:18:18,513 DEBUG util.EncodingDetector - file://localhost/data/: 
charset windows-1252 (detect,16% confidence)
2009-04-29 15:18:18,514 DEBUG util.EncodingDetector - file://localhost/data/: 
charset iso-8859-9 (detect, 14% confidence)
2009-04-29 15:18:18,514 DEBUG util.EncodingDetector - file://localhost/data/: 
charset windows-1252 (detect,13% confidence)
2009-04-29 15:18:18,514 DEBUG util.EncodingDetector - file://localhost/data/: 
charset windows-1252 (detect,13% confidence)
2009-04-29 15:18:18,515 DEBUG util.EncodingDetector - file://localhost/data/: 
charset iso-8859-2 (detect, 11% confidence)
2009-04-29 15:18:18,515 DEBUG util.EncodingDetector - file://localhost/data/: 
charset big5 (detect, 10% confidence)
2009-04-29 15:18:18,515 DEBUG util.EncodingDetector - file://localhost/data/: 
charset x-windows-949 (detect, 10% confidence)
2009-04-29 15:18:18,516 DEBUG util.EncodingDetector - file://localhost/data/: 
charset euc-jp (detect, 10% cofidence)
2009-04-29 15:18:18,516 DEBUG util.EncodingDetector - file://localhost/data/: 
charset gb18030 (detect, 10% confidence)
2009-04-29 15:18:18,516 DEBUG util.EncodingDetector - file://localhost/data/: 
charset shift_jis (detect, 10% confidence)
2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/: 
charset utf-8 (detect, 10% confidence)
2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/: 
charset iso-8859-2 (detect, 8% confidence)
2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/: 
charset iso-8859-2 (detect, 4% confidence)
2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/: 
Choosing encoding: utf-8 (default)
2009-04-29 15:18:18,553 DEBUG parse.html - Meta tags for 
file://localhost/data/: base=null, noCache=false, noFollow=false, 
noIndex=false, refresh=false, refreshHref=null
2009-04-29 15:18:18,560 DEBUG parse.html - found 6 outlinks in 
file://localhost/data/

However nutch is also trying to index the directory himself :

2009-04-29 15:20:27,283 DEBUG indexer.Indexer - Indexing 
[file://localhost/data/] with analyzer 
org.apache.nutch.analysis.en.englishanaly...@17c2891 (en)

Is there a way to tell nutch to find outlinks from directories, without trying 
to index them ?

Any help would be greatly apreciated.
Regards, 
Vincent

Reply via email to