best practice: 1.4 billions documents

Luca Rondanini Sun, 21 Nov 2010 15:33:41 -0800

Hi everybody,

I really need some good advice! I need to index in lucene something like 1.4
billions documents. I had experience in lucene but I've never worked with
such a big number of documents. Also this is just the number of docs at
"start-up": they are going to grow and fast.


I don't have to tell you that I need the system to be fast and to support
real time updates to the documents

The first solution that came to my mind was to use ParallelMultiSearcher,
splitting the index into many "sub-index" (how many docs per index?
100,000?) but I don't have experience with it and I don't know how well will
scale while the number of documents grows!

A more solid solution seems to build some kind of integration with hadoop.
But I didn't find match about lucene and hadoop integration.

Any idea? Which direction should I go (pure lucene or hadoop)?

Thanks
Luca

best practice: 1.4 billions documents

Reply via email to