TDLN wrote:
I think the nutch readdb command only gives statistics for the crawldb
(crawled Pages) and not the index.


That's correct. You can use Lucene API to retrieve the number of
documents in the index, it's quite simple.

Something like that (you can use BSH to run it as a script, or compile
it in its own class), I'm writing this from my head so there may be some
errors:

import org.apache.lucene.index.*;

public class IndexStats {
   public static void main(String[] args) throws Exception {
      IndexReader ir = IndexReader.open(args[0]);
      System.out.println("Number of documents: " + ir.numDocs());
      ir.close();
   }
}


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to