Hi, Early-adopter here.
I'm considering Lucy for a new project (and I must say, the docs are nice and it's Perl/C which is always welcome in this day and age). So,... I gather from the mailing list that it's production ready, but officially API-unstable. Does API-unstable mean the index format may change any time soon, eg, before the first stable release? The environment is distributed search across a cluster with the intent of keeping search-time sub-second - 3s at most (folks are spoilt by the elephant in the industry, so they lose interest if the page does not return in that time). I see from the docs that distributed search is supported, else it would be a non-starter. Ranking ------- I need to sort results based on a floating point value (actually several). I see Lucy supports this. By how much does custom sorting impact search performance? What about term proximity in documents? Will a matching document rank higher than another if two (or whatever being searched for) terms are physically located closer together? Or is ranking based only on a term count ignoring positional info? What if the matching terms are physically closer to the TOP of the document? Does Lucy consider the relative importance of the search terms themselves? For example, searching for [a b c d] would imply that those terms' importance declines from left to right, with 'a' being the most important, etc. I think there was a Page/Brin paper on this somewhere on the 'tubes. Phrase searches --------------- I see this is supported. Hard to quantify, I know, but by what factor is phrase-searching slower than an equivalent term search? Spelling suggestions -------------------- I may have missed this one in the docs: does Lucy support suggested spelling (a-la Google). One could always use a dictionary, but it would be nice if Lucy built up a dictionary based on the terms encountered during indexing. Merging/optimization -------------------- Merging multiple indexes into larger ones is supported. I see there is also an 'optimize' for faster searching; can one update an index with newer pages after such an optimization, or is it a one-way street? Index checking/verification --------------------------- In a cluster environment all kinds of things go wrong on a weekly basis - when this happens during indexing or merging indexes can be left in a broken state leading to problems in batch processing. Does Lucy have an index-verifier (a-la fsck) to scan an index and report errors (not fix, just check and report)? Which version? -------------- With index format stability being important, which version should I consider using? 0.2.x incubating, or trunk? Language/binding ---------------- I see Perl can be used during indexing/searching, how about PHP on the search side? Presumably PHP bindings (for search-related bindings at least) are on the horizon/done? Not that important, just wondering. Scale ----- Anyone using Lucy on a sizeable index split across nodes in a cluster? By sizeable I mean > 1-2TB. If so, how's your search times (yes, I know, it depends on caching/memory/IO/CPUs/#nodes)? Any comments would be appreciated. g
