On Sat, Dec 3, 2011 at 6:36 PM, Shawn Heisey <s...@elyograg.org> wrote:
> On 12/3/2011 2:25 PM, Ted Dunning wrote: > >> Things have changed since I last did this sort of thing seriously. My >> guess is that this is a relatively small amount of memory to devote to >> search. It used to be that the only way to do this effectively with Lucene >> based systems was to keep the heap relatively small like you have here and >> put the index into a tmpfs mount. I think better ways are now available >> which would keep the index in memory in the search engine itself for better >> speed. One customer that we have now has search engines with 128GB of >> memory. He fills much of that with live index sharded about 10-fold. >> In-memory indexes can run enough faster to be more cost effective than disk >> based indexes because you need so many fewer machines to run the searches >> in the required response time. >> > > My servers (two for each chain, a total of four) are at their maximum > memory size of 64GB. They have two quad-core Xeon processors (E54xx > series) in them that are not hyperthreaded. With 8GB given to Solr, there > is approximately 55GB available for the disk cache, which is smaller than > the size of the three large indexes (20GB each) on each server, and the > indexes are constantly getting bigger. I don't think in-memory indexes is > an option for me. Read the papers I referred to. They describe how to search fairly enormous corpus with an 8GB in-memory index (and no disk cache at all). I have 16 processor cores available for each index chain (two servers). If > I set aside one for the distributed search itself and one for the > incremental (that small 3.5 to 7 day shard), it sounds like my ideal > numShards from Solr's perspective is 14. I have some fear that my database > server will fall over under the load of 14 DB connections during a full > index rebuild, though. Do you have any other thoughts for me? > Off-line indexing from a flat-file dump? My guess is that you can dump to disk from the db faster than you can index and a single dumping thread might be faster than many.