disable the html-parser from the nutch-site and keep only your parser.
you can also add in uour filter file this : -(htm|html)$

thx



> Date: Mon, 26 Oct 2009 17:53:11 +0300
> Subject: How to index files only with specific type
> From: dfun...@gmail.com
> To: nutch-user@lucene.apache.org
> 
> Hi, I've create parser and indexer to specific file type(geo xml meta
> file - kml).
> I am trying to crawl couple of sites, and index only files of this type.
> I don't want to index html or anything else.
> How can I achieve this?
> Thanks.-
                                          
_________________________________________________________________
Save up to 84% on Windows 7 until Jan 3—eligible CDN College & University 
students only. Hurry—buy it now for $39.99!
http://go.microsoft.com/?linkid=9691635

Reply via email to