Thanks Mark. I appreciate the help. I thought our memory may be low but wanted to verify there if there is any way to control memory usage. I think we'll likely upgrade the memory on the machines but that may just delay the inevitable.
Wondering if anyone else has encountered similar issues with indices if a similar size. I've been thinking we will need to move to a clustered solution and have been reading on hadoop, nutch, solr & terracotta for possibilities such as index sharding. Has anyone implemented a solution using hadoop or terracotta for a large scale system? Just wondering the pro's / con's of the various approaches. Thanks, Todd On Wed, Oct 29, 2008 at 6:07 PM, Mark Miller <[EMAIL PROTECTED]> wrote: > The term, terminfo, indexreader internals stuff is prob on the low end > compared to the size of your field caches (needed for sorting). If you are > sorting by String I think the space needed is 32 bits x number of docs + an > array to hold all of the unique terms. So checking 300 million docs (I know > you are actually breaking it up smaller than that, but for example) and > ignoring things like String chars being variable byte lengths and storing > the length, etc and randomly picking 50000 unique terms at 6 chars per: > > 32 bits x 300000000 + 50000 x 6 x 16 bits to MB = 1 144.98138 megabytes > > Thats per field your sorting on. If you are sorting on an int field it > should be closer to 32 bits x num docs - shorts, 32 bits x num docs, etc. > > So you have those field caches, plus the IndexReader terminfo, term stuff, > plus whatever RAM your app needs beyond Lucene. 4 gig might just not *quite* > cut it is my guess. > > Todd Benge wrote: >> >> There's usually only a couple sort fields and a bunch of terms in the >> various indices. The terms are user entered on various media so the >> number of terms is very large. >> >> Thanks for the help. >> >> Todd >> >> >> >> On 10/29/08, Todd Benge <[EMAIL PROTECTED]> wrote: >> >>> >>> Hi, >>> >>> I'm the lead engineer for search on a large website using lucene for >>> search. >>> >>> We're indexing about 300M documents in ~ 100 indices. The indices add >>> up to ~ 60G. >>> >>> The indices are sorted into 4 different Multisearcher with the largest >>> handling ~50G. >>> >>> The code is basically like the following: >>> >>> private static MultiSearcher searcher; >>> >>> public void init(File files) { >>> >>> IndexSearcer [] searchers = new IndexSearcher[files.length] (); >>> int i = 0; >>> for ( File file: files ) { >>> searchers[i++] = new >>> IndexSearcher(FSDirectory.getDirectory(file); >>> } >>> >>> searcher = new MultiSearcher(searchers); >>> } >>> >>> public Searcher getSearcher() { >>> return searcher; >>> } >>> >>> We're seeing a high cache rate with Term & TermInfo in Lucene 2.4. >>> Performance is good but servers are consistently hanging with >>> OutOfMemory errors. >>> >>> We're allocating 4G in the heap to each server. >>> >>> Is there any way to control the amount of memory Lucene consume for >>> caching? Any other suggestions on fixing the memory errors? >>> >>> Thanks, >>> >>> Todd >>> >>> >> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]