Hello, I am crawling my local File System using nutch 1.0
in my url file, i put file://localhost/data/ During the crawl, nuch is parsing the directories in order to find oulinks, wich is allright : 2009-04-29 15:17:56,468 INFO fetcher.OldFetcher - fetching file://localhost/data/ 2009-04-29 15:17:56,793 DEBUG file.File - fetching file://localhost/data/ 2009-04-29 15:18:18,501 DEBUG parse.ParseUtil - Parsing [file://localhost/data/] with [org.apache.nutch.parse.html.htmlpar...@e53220] 2009-04-29 15:18:18,511 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,33% confidence) 2009-04-29 15:18:18,511 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,32% confidence) 2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,32% confidence) 2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,29% confidence) 2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,23% confidence) 2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,20% confidence) 2009-04-29 15:18:18,513 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,19% confidence) 2009-04-29 15:18:18,513 DEBUG util.EncodingDetector - file://localhost/data/: charset iso-8859-2 (detect, 16% confidence) 2009-04-29 15:18:18,513 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,16% confidence) 2009-04-29 15:18:18,514 DEBUG util.EncodingDetector - file://localhost/data/: charset iso-8859-9 (detect, 14% confidence) 2009-04-29 15:18:18,514 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,13% confidence) 2009-04-29 15:18:18,514 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,13% confidence) 2009-04-29 15:18:18,515 DEBUG util.EncodingDetector - file://localhost/data/: charset iso-8859-2 (detect, 11% confidence) 2009-04-29 15:18:18,515 DEBUG util.EncodingDetector - file://localhost/data/: charset big5 (detect, 10% confidence) 2009-04-29 15:18:18,515 DEBUG util.EncodingDetector - file://localhost/data/: charset x-windows-949 (detect, 10% confidence) 2009-04-29 15:18:18,516 DEBUG util.EncodingDetector - file://localhost/data/: charset euc-jp (detect, 10% cofidence) 2009-04-29 15:18:18,516 DEBUG util.EncodingDetector - file://localhost/data/: charset gb18030 (detect, 10% confidence) 2009-04-29 15:18:18,516 DEBUG util.EncodingDetector - file://localhost/data/: charset shift_jis (detect, 10% confidence) 2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/: charset utf-8 (detect, 10% confidence) 2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/: charset iso-8859-2 (detect, 8% confidence) 2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/: charset iso-8859-2 (detect, 4% confidence) 2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/: Choosing encoding: utf-8 (default) 2009-04-29 15:18:18,553 DEBUG parse.html - Meta tags for file://localhost/data/: base=null, noCache=false, noFollow=false, noIndex=false, refresh=false, refreshHref=null 2009-04-29 15:18:18,560 DEBUG parse.html - found 6 outlinks in file://localhost/data/ However nutch is also trying to index the directory himself : 2009-04-29 15:20:27,283 DEBUG indexer.Indexer - Indexing [file://localhost/data/] with analyzer org.apache.nutch.analysis.en.englishanaly...@17c2891 (en) Is there a way to tell nutch to find outlinks from directories, without trying to index them ? Any help would be greatly apreciated. Regards, Vincent
