Re: [lucy-user] IO ponderings

goran kent Sun, 18 Sep 2011 23:21:29 -0700

On Sun, Sep 18, 2011 at 8:44 PM, Nathan Kurz <[email protected]> wrote:
> On Sat, Sep 17, 2011 at 12:47 PM, Marvin Humphrey
> <[email protected]> wrote:
>> On Sat, Sep 17, 2011 at 08:52:41AM +0200, goran kent wrote:
>>> I've been wondering (and I'll eventually get around to performing a
>>> comparative test sometime this weekend) about IO and search
>>> performance (ie, ignore OS caching).
>
> As Marvin pointed out, while it's fine to ask the question about what
> happens when you ignore OS caching, realize that OS caching is crucial
> to Lucy's performance.  We don't do our own caching, rather we rely on
> the OS to do it for us.  A clear understanding of your operating
> systems virtual memory system will be very helpful in figuring out
> bottlenecks.
>
> If you're not already intimate with these details, this article is a
> good start: http://duartes.org/gustavo/blog/category/internals
>
>>> What's the biggest cause of search degradation when Lucy is chugging
>>> through it's on-disk index?
>>>
>>> Physically *finding* data (ie, seeking and thrashing around the disk),
>>> waiting for data to *transfer* from the disk to CPU?
>
> This is going to depend on your exact use case.   I think you can
> assume that all accesses that can be sequential off the disk will be,
> or can easily be made to be sequential by consolidating the index.
> Thus if you are searching for a small number of common words coming
> from text documents, the search time will depend primarily on bulk
> transfer speed from your disk.  If on the other hand if each query is
> for a list of hundreds of rare part numbers, seek time will dominate
> and an SSD might help a lot.
>
> And for the earlier question of rerunning queries until adequate
> coverage is achieved, this probably isn't as inefficient as you'd
> guess.  If you presume you'll be reading a bunch of data from disk,
> once you have the data in the OS cache, running another 5 queries
> probably doesn't even double your search time.  Unless your index is
> so large that you can't even fit a single query into RAM, in which
> case you've got other problems.
>
>
>> Well, the projects I've been involved with have taken the approach that there
>> should always be enough RAM on the box to fit the necessary index files.  
>> "RAM
>> is the new disk" as they say.
>>
>> I can tell you that once an index is in RAM, we're CPU bound.
>
> While it's probably technically true that we're CPU bound, I think the
> way to improve performance is not by shaving cycles but by figuring
> out better ways to take advantage of memory locality.  Currently we do
> a pretty good job of avoiding disk access.  Eventually we'll start
> doing better at avoiding RAM access, so  we can do more operations in
> L3.
>
> Sticking with Gustavo:
> http://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-wait
>
> Ulrich's paper is a great intro as well:
> http://people.redhat.com/drepper/cpumemory.pdf
>
>>> I'm quite interested to know whether using an SSD where seek time and
>>> other latency issues are almost zero would dramatically improve search
>>> times.  I've seen vast improvements when using them in RDBMS', but
>>> this may not translate as well here.
>>
>> I would speculate that with SSDs you'd get a more graceful performance
>> degradation as Lucy's RAM requirements start to exceed what the box can
>> provide.  But I have no numbers to back that up.
>
> It will depend on a lot of factors.  My instinct is that SSD's will
> help but won't be cost effective.   I think that you'd be better off
> spending all your budget on a motherboard that supports a lot of RAM
> (which probably means a dual Xeon:
> http://www.supermicro.com/products/motherboard/Xeon1333), as much  ECC
> RAM as you can afford (144GB = $4k; 288GB = $10K), and then a cheap
> RAID of big fast spinning disks.
>
> I don't know these guys, but they might be useful for quick ballpark
> prices: http://www.abmx.com/dual-xeon-server
>
>> My index is way too large to fit into RAM - yes, it's split across a
>> cluster, but there are physical space and cost constraints, so the
>> cluster cannot get much larger.  That's my reality, unfortunately.
>>
>> Hence my emphasis on IO and ways to address that with alternate tech
>> such as SSD.
>
> Goran:  you'll probably get better advice if you offer more details on
> these constraints, your anticipated usage, a real estimate of corpus
> size, and your best guess as to usage patterns.  "Way too large" can
> mean many things to many people.  There may be a sweet spot between
> 1TB and 10TB where a SSD RAID might make sense, but less than that I
> think you're better with RAM and more than that you're probably
> getting unwieldy.   Numbers would help.


The existing index size is about 2.2TB (using another system), which
will shrink to about 1.5TB after re-indexing with Lucy.
The ideal of course is to match that with RAM, so we'll see.

>
> Equally, "physical space and cost constraints" have a lot of wiggle
> room.  Are you dumpster diving for 286's, or trying to avoid custom
> made motherboards?   Do you have only a single rack to work with, are
> you trying to make something that can be worn as a wrist watch?  :)
>
> Nathan Kurz
> [email protected]
>
> ps.  One other comment that I haven't seen made:  Lucy is optimized
> for a 64-bit OS.  Most of the development and testing has been done on
> Linux.  Thus if you are performance obsessed, trying to run at a large
> scale, and want something works out of the box, you probably want to
> be running on 64-bit Linux at this point.

Yes, this is the case.

Re: [lucy-user] IO ponderings

Reply via email to