Thank you Dennis,

OK, I'll have a look to webgraph tools, and i'll post the solution.

I don't use the crawl command, i use the recrawl script explained by Susam Pal 
( Thanks to Her too ! )

Regards
Vincent.

----- Mail Original -----
De: "Dennis Kubes" <[email protected]>
À: [email protected]
Envoyé: Jeudi 30 Avril 2009 15h42:49 GMT +01:00 Amsterdam / Berlin / Berne / 
Rome / Stockholm / Vienne
Objet: Re: Is it possible to avoid Nutch 1.0 from indexing local directories ?

Without fetching, no.  Without indexing yes.  You can run the fetcher on 
these directories.  Then use the webgraph tools to find just inlinks or 
outlink.

It looks like below you are probably using the crawl command which 
performs the entire stack from fetching and parsing to indexing.  You 
can run the commands individually to avoid indexing if you like.

Dennis

[email protected] wrote:
> Hello, 
> 
> I am crawling my local File System using nutch 1.0
> 
> in my url file, i put
> file://localhost/data/
> 
> During the crawl, nuch is parsing the directories in order to find oulinks, 
> wich is allright :
> 
> 2009-04-29 15:17:56,468 INFO  fetcher.OldFetcher - fetching 
> file://localhost/data/
> 2009-04-29 15:17:56,793 DEBUG file.File - fetching file://localhost/data/
> 2009-04-29 15:18:18,501 DEBUG parse.ParseUtil - Parsing 
> [file://localhost/data/] with [org.apache.nutch.parse.html.htmlpar...@e53220]
> 2009-04-29 15:18:18,511 DEBUG util.EncodingDetector - file://localhost/data/: 
> charset windows-1252 (detect,33% confidence)
> 2009-04-29 15:18:18,511 DEBUG util.EncodingDetector - file://localhost/data/: 
> charset windows-1252 (detect,32% confidence)
> 2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/: 
> charset windows-1252 (detect,32% confidence)
> 2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/: 
> charset windows-1252 (detect,29% confidence)
> 2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/: 
> charset windows-1252 (detect,23% confidence)
> 2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/: 
> charset windows-1252 (detect,20% confidence)
> 2009-04-29 15:18:18,513 DEBUG util.EncodingDetector - file://localhost/data/: 
> charset windows-1252 (detect,19% confidence)
> 2009-04-29 15:18:18,513 DEBUG util.EncodingDetector - file://localhost/data/: 
> charset iso-8859-2 (detect, 16% confidence)
> 2009-04-29 15:18:18,513 DEBUG util.EncodingDetector - file://localhost/data/: 
> charset windows-1252 (detect,16% confidence)
> 2009-04-29 15:18:18,514 DEBUG util.EncodingDetector - file://localhost/data/: 
> charset iso-8859-9 (detect, 14% confidence)
> 2009-04-29 15:18:18,514 DEBUG util.EncodingDetector - file://localhost/data/: 
> charset windows-1252 (detect,13% confidence)
> 2009-04-29 15:18:18,514 DEBUG util.EncodingDetector - file://localhost/data/: 
> charset windows-1252 (detect,13% confidence)
> 2009-04-29 15:18:18,515 DEBUG util.EncodingDetector - file://localhost/data/: 
> charset iso-8859-2 (detect, 11% confidence)
> 2009-04-29 15:18:18,515 DEBUG util.EncodingDetector - file://localhost/data/: 
> charset big5 (detect, 10% confidence)
> 2009-04-29 15:18:18,515 DEBUG util.EncodingDetector - file://localhost/data/: 
> charset x-windows-949 (detect, 10% confidence)
> 2009-04-29 15:18:18,516 DEBUG util.EncodingDetector - file://localhost/data/: 
> charset euc-jp (detect, 10% cofidence)
> 2009-04-29 15:18:18,516 DEBUG util.EncodingDetector - file://localhost/data/: 
> charset gb18030 (detect, 10% confidence)
> 2009-04-29 15:18:18,516 DEBUG util.EncodingDetector - file://localhost/data/: 
> charset shift_jis (detect, 10% confidence)
> 2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/: 
> charset utf-8 (detect, 10% confidence)
> 2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/: 
> charset iso-8859-2 (detect, 8% confidence)
> 2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/: 
> charset iso-8859-2 (detect, 4% confidence)
> 2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/: 
> Choosing encoding: utf-8 (default)
> 2009-04-29 15:18:18,553 DEBUG parse.html - Meta tags for 
> file://localhost/data/: base=null, noCache=false, noFollow=false, 
> noIndex=false, refresh=false, refreshHref=null
> 2009-04-29 15:18:18,560 DEBUG parse.html - found 6 outlinks in 
> file://localhost/data/
> 
> However nutch is also trying to index the directory himself :
> 
> 2009-04-29 15:20:27,283 DEBUG indexer.Indexer - Indexing 
> [file://localhost/data/] with analyzer 
> org.apache.nutch.analysis.en.englishanaly...@17c2891 (en)
> 
> Is there a way to tell nutch to find outlinks from directories, without 
> trying to index them ?
> 
> Any help would be greatly apreciated.
> Regards, 
> Vincent

Reply via email to