Re: Micro-Sharding

Ted Dunning Sat, 03 Dec 2011 23:42:03 -0800

On Sat, Dec 3, 2011 at 6:36 PM, Shawn Heisey <s...@elyograg.org> wrote:


> On 12/3/2011 2:25 PM, Ted Dunning wrote:
>
>> Things have changed since I last did this sort of thing seriously. My
>> guess is that this is a relatively small amount of memory to devote to
>> search. It used to be that the only way to do this effectively with Lucene
>> based systems was to keep the heap relatively small like you have here and
>> put the index into a tmpfs mount. I think better ways are now available
>> which would keep the index in memory in the search engine itself for better
>> speed. One customer that we have now has search engines with 128GB of
>> memory. He fills much of that with live index sharded about 10-fold.
>> In-memory indexes can run enough faster to be more cost effective than disk
>> based indexes because you need so many fewer machines to run the searches
>> in the required response time.
>>
>
> My servers (two for each chain, a total of four) are at their maximum
> memory size of 64GB.  They have two quad-core Xeon processors (E54xx
> series) in them that are not hyperthreaded.  With 8GB given to Solr, there
> is approximately 55GB available for the disk cache, which is smaller than
> the size of the three large indexes (20GB each) on each server, and the
> indexes are constantly getting bigger.  I don't think in-memory indexes is
> an option for me.


Read the papers I referred to.  They describe how to search fairly enormous
corpus with an 8GB in-memory index (and no disk cache at all).

I have 16 processor cores available for each index chain (two servers).  If
> I set aside one for the distributed search itself and one for the
> incremental (that small 3.5 to 7 day shard), it sounds like my ideal
> numShards from Solr's perspective is 14.  I have some fear that my database
> server will fall over under the load of 14 DB connections during a full
> index rebuild, though.  Do you have any other thoughts for me?
>

Off-line indexing from a flat-file dump?  My guess is that you can dump to
disk from the db faster than you can index and a single dumping thread
might be faster than many.

Re: Micro-Sharding

Reply via email to