Re: [Nutch-general] How to determine the number of pages in the index?

Enzo Michelangeli Sat, 28 Jul 2007 04:01:20 -0700

Thanks. I was hoping for something already written, but I'm afraid I'll have 
to follow your suggestion...


By the way, at least in my case (pages only fetched with HTTP) Luke shows 
that the "Number of documents" is exactly equal to the frequency of the term 
"http" in the "url" field, so this also kind of works:

bin/nutch org.apache.nutch.searcher.NutchBean url:http \
| sed -n -e 's/Total hits: //p'

Enzo

----- Original Message ----- 
From: "DES" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, July 28, 2007 5:43 PM
Subject: Re: How to determine the number of pages in the index?


> look at org.apache.lucene.index.IndexReader.numDocs() method. You can
> write a simple utility to run it in the shell.
>
> On 7/28/07, Enzo Michelangeli <[EMAIL PROTECTED]> wrote:
>> Is there a quick way of knowing how many pages are indexed (_not_ how 
>> many
>> are referenced in crawldb as fetched URL's)? I could use Luke to peek 
>> inside
>> the indexes and get the "Number of documents", but they are located on a
>> remote headless server with only SSH access... (OK, I actually did access
>> them using Sftpdrive, but I'd like to have a command line to invoke in a
>> shell script...)
>>
>> Enzo
>>
>>
> 


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] How to determine the number of pages in the index?

Reply via email to