Re: [Nutch-general] Integrate nutch crawler with Solr index server

Brian Whitman Tue, 26 Jun 2007 14:38:23 -0700

On Jun 26, 2007, at 3:22 PM, rubdabadub wrote:
>>
>> I currently use Sami's SolrIndexer with the trunk solrj, and we have
>> a single Solr index of about 5m pages on a single 4GB machine, with
>> stored content. Although the indexing is fast and stable, complicated
>> full text queries are too slow for comfort (forget about MLT/faceting
>> etc.) We are currently looking into ways of partitioning this and we
>> may be of service in the future here.
>
> Brain just wondering searching woudn't that be more of a Solr issue?
> I know some of the Solr site has more then 5m docs? no? are you
> doing something special? I am very curious to know. We are
> looking into implementing Solr on production and so far so good.  
> However
> we are only dealing with 10 fileds 3 mil lucene doc.
>


The Solr installations I know with many millions of docs don't have  
hundreds of KB of text per doc. The "special" thing I'm doing is  
storing the parse text from the nutch crawls (and other sources),  
which we need for various reasons. We have an extraordinary amount of  
unique tokens, which turns Solr/Lucene into a disk seek speed test.  
Full text search is certainly possible, even with stored content, but  
I am seeing a drop off in QTime (milliseconds to process and return a  
solr query) after we crossed the 2-3m document mark. It's currently  
at ~200-1000ms or so for uncached single term queries on a very nice  
server with lots of heap. Not tenable for a real-time case (but we  
don't use it in this manner.)



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Integrate nutch crawler with Solr index server

Reply via email to