Scaling Lucene to 500 million+ documents - preferred architecture

2007-07-07 Thread muraalee
Hi Everybody, We are building a search infrastructure using lucene to scale upto 500 million document with search < 500 ms. Here is my rough math on the size of content & index : Total Documents = 500 million documents Size / Document = 10k / document Index Size / Million = 2 GB / million documen

Re: Related Article question

2007-07-07 Thread Ryan Ackley
I was playing around with MoreLikeThis and I noticed the problems you are talking about as well. One idea I thought of was for MoreLikeThis to focus only on proper nouns for the purposes of similarity or give a significant boost to those. Pretty much the same idea you had in #1. I found a list o

Re: Chinese words highlighting

2007-07-07 Thread Koji Sekiguchi
One possibility I can think of is that you are using CJKAnalyzer and Lucene 2.0 or previous version. The combination of those cannot highlight CJK keywords correctly. If this is your case, try StandardAnalyzer or upgrading Lucene 2.1/2.2 and its CJKAnalyzer and highlighter. Also check: http://

Re: Scaling up to several machines with Lucene

2007-07-07 Thread Chun Wei Ho
Thanks for your comments and suggestions everyone :) It looks like the general trend is to be in favour of (2) splitting the frontend web application and the searching application. Solr looks a lot like what we would liked, but unfortunately we finished our application a while before Solr initia

Lucene index sizes and performance

2007-07-07 Thread Chun Wei Ho
We are currently running a search service with a single Lucene index of about 10 GB. We would like to find out: (a) What is the usual index size of everyone else? How large have Lucene index gone in prodution environments, and is there a sort of a optimal size that Lucene indexes should be? (b)

Re: Lucene index sizes and performance

2007-07-07 Thread Chris Lu
Not really suggestion but some points to consider. (a) Greatly depending on your hardware, especially harddrive speed. (b) Do you do SortBy? Each SortBy field will need an array in memory. If no sortBy, reserve memory for about 10~15% of index size will be enough. (c) Maybe try to split by index c

Re: problems with deleteDocuments

2007-07-07 Thread Nadav Har'El
On Wed, Jul 04, 2007, Erick Erickson wrote about "Re: problems with deleteDocuments": > Consider what would happen otherwise. Say you have documents > with the following values for a field (call it blah). > some data > some data I put in the index > lots of data > data > > Then I don't want delet