We're getting up there in terms of corpus size for our Lucene indexing 
application:
* 20 million documents
* all fields need to be stored
* 10 short fields / document 
* 1 long free text field / document (analyzed with a custom shingle-based 
analyzer)
* 140GB total index size
* Optimized into a single segment
* Must run over NFS due to VMWare setup

I think I've already taken the most common steps to reduce memory requirements 
and increase performance on the searching side including:
* omitting norms on all fields except two
* omitting term vectors
* indexing as few fields as possible
* reusing a single searcher
* splitting the index up into N shards for ParallelMultiSearcher

The application will run with 10G of -Xmx but any less and it bails out. It 
seems happier if we feed it 12GB. The searches are starting to bog down a bit 
(5-10 seconds for some queries)...

Our next step was to deploy the shards as RemoteSearchables for the same 
ParallelMultiSearcher over RMI - but before I do that I'm curious:
* are there other ways to get that memory usage down?
* are there performance optimizations that I haven't thought of?

Thanks,
-Chris


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to