Not sure if this helps you, but some of the issue you are facing seem similar to those in the "real time" search threads.

Basically their problem involves indexing twitter and the blogosphere, and making lucene work for super large data sets like that.

Perhaps some of the discussion in those threads could help. I'd imagine they went over things like massively distributed searches and such, and other things that might be of interest to you.

Sorry I can't be of more help than that.

Matt

tsuraan wrote:
Out of curiosity, what is the size of your corpus?  How much and how
quickly do you expect it to grow?

in terms of lucene documents, we tend to have in the 10M-100M range.
Currently we use merging to make larger indices from smaller ones, so
a single index can have a lot of documents in it, but merging takes a
long time so I'm trying to test out just using a ton of tiny indices
to see if the search penalty from doing that is worth the time savings
from not having to build and optimize large indices.

I'm just trying to make sure that we are all on the same page here ^^

I can see the benefits of doing what you are describing with a very
large corpus that is expected to grow at quick rate, but if that's not
really your use case, then perhaps it might be worth investigating if a
simpler solution would serve you just as well.

The indices also grow pretty quickly.  We have some cases where we get
nearly a million new documents per day.  I haven't looked at those
machines for quite a while, but I guess they'd probably have well over
a hundred million documents, and still are growing.  We also don't
have a lot of simultaneous searches yet, but that's changing, so I'm
getting concerned about how well that's being handled.  We expect that
we will soon be dealing with tens to hundreds of searches being
executed simultaneously.

In the example you provided, you are only talking about searching
against 1M documents, which I can guarantee will search with VERY good
performance in a single properly setup lucene index.

Now if we are talking more on the order of... 100M or more documents you
may be onto something.

But in any case, there isn't currently any framework for making
multiple searches simultaneously use an index in a coordinated
fashion.  I was pretty much just planning on adding it to my tests if
such a thing existed.  Since it doesn't, I guess I'll stuck to
searching in parallel and hoping that the Linux VFS layer is smart
enough to keep things fast until I have time to try putting something
together myself.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to