Hi Everybody,
We are building a search infrastructure using lucene to scale upto 500
million document with search < 500 ms.
Here is my rough math on the size of content & index :
Total Documents = 500 million documents
Size / Document = 10k / document
Index Size / Million = 2 GB / million documen
I was playing around with MoreLikeThis and I noticed the problems you
are talking about as well.
One idea I thought of was for MoreLikeThis to focus only on proper
nouns for the purposes of similarity or give a significant boost to
those. Pretty much the same idea you had in #1.
I found a list o
One possibility I can think of is that you are using CJKAnalyzer and
Lucene 2.0 or previous version.
The combination of those cannot highlight CJK keywords correctly.
If this is your case, try StandardAnalyzer or upgrading Lucene 2.1/2.2
and its CJKAnalyzer and highlighter.
Also check:
http://
Thanks for your comments and suggestions everyone :)
It looks like the general trend is to be in favour of (2) splitting
the frontend web application and the searching application.
Solr looks a lot like what we would liked, but unfortunately we
finished our application a while before Solr initia
We are currently running a search service with a single Lucene index
of about 10 GB. We would like to find out:
(a) What is the usual index size of everyone else? How large have
Lucene index gone in prodution environments, and is there a sort of a
optimal size that Lucene indexes should be?
(b)
Not really suggestion but some points to consider.
(a) Greatly depending on your hardware, especially harddrive speed.
(b) Do you do SortBy? Each SortBy field will need an array in memory.
If no sortBy, reserve memory for about 10~15% of index size will be enough.
(c) Maybe try to split by index c
On Wed, Jul 04, 2007, Erick Erickson wrote about "Re: problems with
deleteDocuments":
> Consider what would happen otherwise. Say you have documents
> with the following values for a field (call it blah).
> some data
> some data I put in the index
> lots of data
> data
>
> Then I don't want delet