On Sun, Sep 18, 2011 at 8:44 PM, Nathan Kurz <[email protected]> wrote: > On Sat, Sep 17, 2011 at 12:47 PM, Marvin Humphrey > <[email protected]> wrote: >> On Sat, Sep 17, 2011 at 08:52:41AM +0200, goran kent wrote: >>> I've been wondering (and I'll eventually get around to performing a >>> comparative test sometime this weekend) about IO and search >>> performance (ie, ignore OS caching). > > As Marvin pointed out, while it's fine to ask the question about what > happens when you ignore OS caching, realize that OS caching is crucial > to Lucy's performance. We don't do our own caching, rather we rely on > the OS to do it for us. A clear understanding of your operating > systems virtual memory system will be very helpful in figuring out > bottlenecks. > > If you're not already intimate with these details, this article is a > good start: http://duartes.org/gustavo/blog/category/internals > >>> What's the biggest cause of search degradation when Lucy is chugging >>> through it's on-disk index? >>> >>> Physically *finding* data (ie, seeking and thrashing around the disk), >>> waiting for data to *transfer* from the disk to CPU? > > This is going to depend on your exact use case. I think you can > assume that all accesses that can be sequential off the disk will be, > or can easily be made to be sequential by consolidating the index. > Thus if you are searching for a small number of common words coming > from text documents, the search time will depend primarily on bulk > transfer speed from your disk. If on the other hand if each query is > for a list of hundreds of rare part numbers, seek time will dominate > and an SSD might help a lot. > > And for the earlier question of rerunning queries until adequate > coverage is achieved, this probably isn't as inefficient as you'd > guess. If you presume you'll be reading a bunch of data from disk, > once you have the data in the OS cache, running another 5 queries > probably doesn't even double your search time. Unless your index is > so large that you can't even fit a single query into RAM, in which > case you've got other problems. > > >> Well, the projects I've been involved with have taken the approach that there >> should always be enough RAM on the box to fit the necessary index files. >> "RAM >> is the new disk" as they say. >> >> I can tell you that once an index is in RAM, we're CPU bound. > > While it's probably technically true that we're CPU bound, I think the > way to improve performance is not by shaving cycles but by figuring > out better ways to take advantage of memory locality. Currently we do > a pretty good job of avoiding disk access. Eventually we'll start > doing better at avoiding RAM access, so we can do more operations in > L3. > > Sticking with Gustavo: > http://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-wait > > Ulrich's paper is a great intro as well: > http://people.redhat.com/drepper/cpumemory.pdf > >>> I'm quite interested to know whether using an SSD where seek time and >>> other latency issues are almost zero would dramatically improve search >>> times. I've seen vast improvements when using them in RDBMS', but >>> this may not translate as well here. >> >> I would speculate that with SSDs you'd get a more graceful performance >> degradation as Lucy's RAM requirements start to exceed what the box can >> provide. But I have no numbers to back that up. > > It will depend on a lot of factors. My instinct is that SSD's will > help but won't be cost effective. I think that you'd be better off > spending all your budget on a motherboard that supports a lot of RAM > (which probably means a dual Xeon: > http://www.supermicro.com/products/motherboard/Xeon1333), as much ECC > RAM as you can afford (144GB = $4k; 288GB = $10K), and then a cheap > RAID of big fast spinning disks. > > I don't know these guys, but they might be useful for quick ballpark > prices: http://www.abmx.com/dual-xeon-server > >> My index is way too large to fit into RAM - yes, it's split across a >> cluster, but there are physical space and cost constraints, so the >> cluster cannot get much larger. That's my reality, unfortunately. >> >> Hence my emphasis on IO and ways to address that with alternate tech >> such as SSD. > > Goran: you'll probably get better advice if you offer more details on > these constraints, your anticipated usage, a real estimate of corpus > size, and your best guess as to usage patterns. "Way too large" can > mean many things to many people. There may be a sweet spot between > 1TB and 10TB where a SSD RAID might make sense, but less than that I > think you're better with RAM and more than that you're probably > getting unwieldy. Numbers would help.
The existing index size is about 2.2TB (using another system), which will shrink to about 1.5TB after re-indexing with Lucy. The ideal of course is to match that with RAM, so we'll see. > > Equally, "physical space and cost constraints" have a lot of wiggle > room. Are you dumpster diving for 286's, or trying to avoid custom > made motherboards? Do you have only a single rack to work with, are > you trying to make something that can be worn as a wrist watch? :) > > Nathan Kurz > [email protected] > > ps. One other comment that I haven't seen made: Lucy is optimized > for a 64-bit OS. Most of the development and testing has been done on > Linux. Thus if you are performance obsessed, trying to run at a large > scale, and want something works out of the box, you probably want to > be running on 64-bit Linux at this point. Yes, this is the case.
