Hello,
I am crawling my local File System using nutch 1.0
in my url file, i put
file://localhost/data/
During the crawl, nuch is parsing the directories in order to find oulinks,
wich is allright :
2009-04-29 15:17:56,468 INFO fetcher.OldFetcher - fetching
file://localhost/data/
2009-04-29 15:17:56,793 DEBUG file.File - fetching file://localhost/data/
2009-04-29 15:18:18,501 DEBUG parse.ParseUtil - Parsing
[file://localhost/data/] with [org.apache.nutch.parse.html.htmlpar...@e53220]
2009-04-29 15:18:18,511 DEBUG util.EncodingDetector - file://localhost/data/:
charset windows-1252 (detect,33% confidence)
2009-04-29 15:18:18,511 DEBUG util.EncodingDetector - file://localhost/data/:
charset windows-1252 (detect,32% confidence)
2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/:
charset windows-1252 (detect,32% confidence)
2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/:
charset windows-1252 (detect,29% confidence)
2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/:
charset windows-1252 (detect,23% confidence)
2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/:
charset windows-1252 (detect,20% confidence)
2009-04-29 15:18:18,513 DEBUG util.EncodingDetector - file://localhost/data/:
charset windows-1252 (detect,19% confidence)
2009-04-29 15:18:18,513 DEBUG util.EncodingDetector - file://localhost/data/:
charset iso-8859-2 (detect, 16% confidence)
2009-04-29 15:18:18,513 DEBUG util.EncodingDetector - file://localhost/data/:
charset windows-1252 (detect,16% confidence)
2009-04-29 15:18:18,514 DEBUG util.EncodingDetector - file://localhost/data/:
charset iso-8859-9 (detect, 14% confidence)
2009-04-29 15:18:18,514 DEBUG util.EncodingDetector - file://localhost/data/:
charset windows-1252 (detect,13% confidence)
2009-04-29 15:18:18,514 DEBUG util.EncodingDetector - file://localhost/data/:
charset windows-1252 (detect,13% confidence)
2009-04-29 15:18:18,515 DEBUG util.EncodingDetector - file://localhost/data/:
charset iso-8859-2 (detect, 11% confidence)
2009-04-29 15:18:18,515 DEBUG util.EncodingDetector - file://localhost/data/:
charset big5 (detect, 10% confidence)
2009-04-29 15:18:18,515 DEBUG util.EncodingDetector - file://localhost/data/:
charset x-windows-949 (detect, 10% confidence)
2009-04-29 15:18:18,516 DEBUG util.EncodingDetector - file://localhost/data/:
charset euc-jp (detect, 10% cofidence)
2009-04-29 15:18:18,516 DEBUG util.EncodingDetector - file://localhost/data/:
charset gb18030 (detect, 10% confidence)
2009-04-29 15:18:18,516 DEBUG util.EncodingDetector - file://localhost/data/:
charset shift_jis (detect, 10% confidence)
2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/:
charset utf-8 (detect, 10% confidence)
2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/:
charset iso-8859-2 (detect, 8% confidence)
2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/:
charset iso-8859-2 (detect, 4% confidence)
2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/:
Choosing encoding: utf-8 (default)
2009-04-29 15:18:18,553 DEBUG parse.html - Meta tags for
file://localhost/data/: base=null, noCache=false, noFollow=false,
noIndex=false, refresh=false, refreshHref=null
2009-04-29 15:18:18,560 DEBUG parse.html - found 6 outlinks in
file://localhost/data/
However nutch is also trying to index the directory himself :
2009-04-29 15:20:27,283 DEBUG indexer.Indexer - Indexing
[file://localhost/data/] with analyzer
org.apache.nutch.analysis.en.englishanaly...@17c2891 (en)
Is there a way to tell nutch to find outlinks from directories, without trying
to index them ?
Any help would be greatly apreciated.
Regards,
Vincent