On Sat, Jan 30, 2010 at 1:15 PM, Marvin Humphrey <[email protected]> wrote: > On Sat, Jan 30, 2010 at 12:11:41PM -0800, Nathan Kurz wrote: > >> The window where this choice is beneficial is small: something like >> 32-bit systems using 2-4 Gig indexes with multiple sortable fields >> with unique values. Unless this is the use case that Eventful needs, > > Well, actually... yes, it is.
Then you should do it! As long as you are designing it around a real need, it will probably be a good design choice. > Indexes can actually grow larger than 2-4 GB on such systems and still > maintain top performance. Because 32-bit operating systems can exploit the > full RAM on a machine and use it for system IO cache, you can have indexes > over 4 GB that stay fully RAM-resident. Definitely right, but I'm most interested in cases that allow searching for full quotes, hence no stop words. In my mind, once you can't map in positions for the word 'the', you're done. The obvious answer to this is that it's segment size, rather than index size, that matters here. But isn't this true of sort caches as well? They don't cross segments, do they? > The problem with running out of address space is that there's no warning > before catastrophic failure, and then no possibility of recovery short of > rearchitecting your search infrastructure or installing a new operating > system. It's a really serious glitch to hit. It would suck if Eventful hit > it, but I really don't want anybody else to hit it either. OK, but you can pretty well catch this at index creation time, can't you? And even failing at run time with a clear error (mmap failed: too large to map) might be preferable to the sticky morass of a steeply declining performance curve once you start to swap. > I should specify that the extra calls to mmap() and munmap() occur on 32-bit > systems only. For 64-bit systems, we mmap() the whole compound file the > instant it gets opened, and InStream_Buf() is just a thin wrapper around some > pointer math. I had not realized that. This softens my position considerably. I'm all for making increasing legacy performance so long as it doesn't complicate the mainline architecture. >> Sure, these systems will exist, but solve the problem in way that benefits >> everyone: shard it! > > Well, that sort of sharding is not within the scope of Lucy itself. It's a > Solr-level solution. Remind me again: what's the difference between multiple segments and sequential sharding? And if you take that world-view, what stops you from processing segments in parallel rather than sequentially? :) Yes, you probably don't want to do all the cross-machine process management, but designing the architecture so that it's possible to aggregate and sort results from multiple queries seems well within bounds. Nathan Kurz [email protected]
