I have improved date-sorted searching performance pretty dramatically by replacing the two step "search then sort" operation with a one step "use the date as the score" algorithm. The main gotcha was making sure to not affect which results get counted as hits in boolean searches, but overall I only spent about a week on the project, and got a 60x speed improvement on the target set. (from minutes to seconds) YMMV however, since the app requires the collection of the complete set of results for analysis.
- andy g On Mon, Jun 29, 2009 at 12:47 AM, Marcus Herou <marcus.he...@tailsweep.com>wrote: > Thanks for the answer. > > Don't you think that part 1 of the email would give you a hint of nature of > the index ? > > Index size(and growing): 16Gx8 = 128G > Doc size (data): 20k > Num docs: 90M > Num users: Few hundred but most critical is that the admin staff which is > using the index all day long. > Query types: Example: title:"Iphone" OR description:"Iphone" sorted by > publishedDate... = Very simple, no fuzzy searches etc. However since the > dataset is large it will consume memory on sorting I guess. > > Could not one draw any conclusions about best-practice in terms of hardware > given the above "specs" ? > > Basically I would like to know if I really need 8 cores since machines with > dual-cpu support are the most expensive and I would like to not throw away > money so getting it right is a matter of economy. > > I mean it is very simple: Let's say someone gives me a budget of 50 000 USD > and I then want to get the most bang for the buck for my workload. > Should I go for > X machines with quad-core 3.0GHz, 4 disks RAID1+0, 8G RAM costing me > 1200USD > a piece (giving me 40 machines: 160 disks, 160 cores, 320G RAM) > or > X machines with dual quad-core 2.0GHz, 4 disks RAID1+0, 36G RAM costing me > 3400 USD a piece (giving me 15 machines: 60 disks, 120 cores, 540G RAM) > > Basically I would like to know what factors make the workload IO bound vs > CPU bound ? > > //Marcus > > > > > > > On Mon, Jun 29, 2009 at 8:53 AM, Eric Bowman <ebow...@boboco.ie> wrote: > > > There is no single answer -- this is always application specific. > > > > Without knowing anything about what you are doing: > > > > 1. disk i/o is probably the most critical. Go SSD or even RAM disk if > > you can, if performance is absolutely critical > > 2. Sometimes CPU can become an issue, but 8 cores is probably enough > > unless you are doing especially cpu-bound searches. > > > > Unless you are doing something with hard performance requirements, or > > really quite unusual, buying "good" kit is probably good enough, and you > > won't really know for sure until you measure. Lucene is a general > > enough tool that there isn't a terribly universal answer to this. We > > were a bit surprised to end up cpu-bound instead of disk i/o-bound, for > > instance, but we ended up taking an unusual path. YMMV. > > > > Marcus Herou wrote: > > > Hi. I think I need to be more specific. > > > > > > What I am trying to find out is if I should aim for: > > > > > > CPU (2x4 cores, 2.0-3.0Ghz)? or perhaps just a 4 cores is enough. > > > Fast disk IO: 8 disks, RAID1+0 ? or perhaps 2 disks is enough... > > > RAM - if the index does not fit into RAM how much RAM should I then buy > ? > > > > > > Please any hints would be appreciated since I am going to invest soon. > > > > > > //Marcus > > > > > > On Sat, Jun 27, 2009 at 12:00 AM, Marcus Herou > > > <marcus.he...@tailsweep.com>wrote: > > > > > > > > >> Hi. > > >> > > >> I currently have an index which is 16GB per machine (8 machines = > 128GB) > > >> (data is stored externally, not in index) and is growing like crazy > (we > > are > > >> indexing blogs which is crazy by nature) and have only allocated 2GB > per > > >> machine to the Lucene app since we are running some other stuff there > in > > >> parallell. > > >> > > >> Each doc should be roughly the size of a blog post, no more than 20k. > > >> > > >> We currently have about 90M documents and it is increasing rapidly so > > >> getting into the G+ document range is not going to be too far away. > > >> > > >> Now due to search performance I think I need to move these instances > to > > >> dedicated index/search machines (or index on some machines and search > on > > >> others). Anyway I would like to get some feedback about two things: > > >> > > >> 1. What is the most important hardware aspect when it comes to add > > document > > >> to the index and optimize it. > > >> 1.1 Is it disk I|O write throghput ? (sequential or random-io ?) > > >> 1.2 Is it RAM ? > > >> 1.3 Is is CPU ? > > >> > > >> My guess would be disk-io, right, wrong ? > > >> > > >> 2. What is the most important hardware aspect when it comes to > searching > > >> documents in my setup ? (result-set is limited to return only the top > 10 > > >> matches with page handling) > > >> 2.1 Is it disk read throughput ? (sequential or random-io ?) > > >> 2.2 Is it RAM ? > > >> 2.3 Is is CPU ? > > >> > > >> I have no clue since the data might not fit into memory. What is then > > the > > >> most important factor ? read-performance while scanning the index ? > CPU > > >> while comparing fields and collecting results ? > > >> > > >> What I'm trying to find out is what I can do to get most bang for the > > buck > > >> with a limited (aren't we all limited?) budget. > > >> > > >> Kindly > > >> > > >> //Marcus > > >> > > >> > > >> > > >> > > >> > > >> -- > > >> Marcus Herou CTO and co-founder Tailsweep AB > > >> +46702561312 > > >> marcus.he...@tailsweep.com > > >> http://www.tailsweep.com/ > > >> > > >> > > >> > > > > > > > > > > > > > > > -- > > Eric Bowman > > Boboco Ltd > > ebow...@boboco.ie > > http://www.boboco.ie/ebowman/pubkey.pgp > > +35318394189/+353872801532< > http://www.boboco.ie/ebowman/pubkey.pgp%0A+35318394189/+353872801532> > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > -- > Marcus Herou CTO and co-founder Tailsweep AB > +46702561312 > marcus.he...@tailsweep.com > http://www.tailsweep.com/ >