Out of curiosity, what is the size of your corpus? How much and how quickly do you expect it to grow?

I'm just trying to make sure that we are all on the same page here ^^

I can see the benefits of doing what you are describing with a very large corpus that is expected to grow at quick rate, but if that's not really your use case, then perhaps it might be worth investigating if a simpler solution would serve you just as well.

In the example you provided, you are only talking about searching against 1M documents, which I can guarantee will search with VERY good performance in a single properly setup lucene index.

Now if we are talking more on the order of... 100M or more documents you may be onto something.

Well, that's my thoughts anyhow

Matt

tsuraan wrote:
If you did this, wouldn't you be binding the processing of the results
of all queries to that of the slowest performing one within the collection?

I would imagine it would, but I haven't seen too much variance between
lucene query speeds in our data.

I'm guessing you are trying for some sort of performance benefit by
batch processing, but I question whether or not you will actually get
more performance by performing your queries in a threaded type
environment, and then processing their results as they come in.

Could you give a bit more description about what you are actually trying
to accomplish, I'm sure this list could help better if we had more
information.

What I'd like to do is build lots of small indices (a few thousand
documents per index) and put them into HDFS for search distribution.
We already have our own map-reduce framework for searching, but HDFS
seems to be a really good fit for an actual storage mechanism.

My concern is that when we have one searcher using thousands of
HDFS-backed indices, the seeking might get a bit nasty. HDFS
apparently has pretty good seeking performance, but it really looks
like it was designed for streaming, so if I could make my searches use
sequential index access, I would expect better performance than having
a ton of simultaneous searches making HDFS seek all over the place.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to