On Thu, Sep 8, 2011 at 6:53 AM, goran kent <[email protected]> wrote: > Hi, > > Early-adopter here. > > I'm considering Lucy for a new project (and I must say, the docs are > nice and it's Perl/C which is always welcome in this day and age).
Hi Goran, and welcome to Lucy! I'll jump in to give you some preliminary answers. I'm familiar with the code base, but out of touch with some of the recent development. So I should be able to give you answers that are basically right, but I might mangle some important details due to recent changes. But I trust that someone else will jump in to correct those parts soon enough. > So,... I gather from the mailing list that it's production ready, but > officially API-unstable. Does API-unstable mean the index format may > change any time soon, eg, before the first stable release? No, I'm not sure what the official claim is, but it's unlikely that index format will change enough to hurt you. Rather than having a single monolithic index format, Lucy allows multiple formats to coexist. The index format should be stable through release. And due to the class structure, even if the mainline changes, it should be possible to have back compatibility for as long as you need it. > The environment is distributed search across a cluster with the intent > of keeping search-time sub-second - 3s at most (folks are spoilt by > the elephant in the industry, so they lose interest if the page does > not return in that time). > > I see from the docs that distributed search is supported, else it > would be a non-starter. This excites me too, but I don't know that anyone is pushing it's limits yet. But architecturally, I think it's well designed to allow really fast clusters of in-ram search. Talking about 3 seconds makes it sound like you're willing to hit disk: you might need some intense tuning here, depending on how you deal with really common stopwords. Also, there are some limitations with custom sort ordering and the like: clusters are going to deal better with floating point than with alphabetical, for example, and excerpts might be a little clunky to retrieve. Currently it's just a DocID and a score that get returned efficiently. > Ranking > ------- > I need to sort results based on a floating point value (actually > several). I see Lucy supports this. By how much does custom sorting > impact search performance? Depends on the complexity of your algorithm, but probably minimally. If you are touching disk, that will dominate everything. If not, you will be Memory IO bound but have lots of excess CPU available to crunch the numbers. > What about term proximity in documents? Will a matching document rank > higher than another if two (or whatever being searched for) terms are > physically located closer together? Or is ranking based only on a > term count ignoring positional info? > > What if the matching terms are physically closer to the TOP of the document? It's doesn't do this particularly well currently. Currently TF/IDF scoring is pretty much be-all and end-all for the built in ranking. But adding or changing to the approach you mention has long been a design goal, should be supported by the architecture, and is just waiting for the right person with the right itch. You'll probably have to write your own scoring algorithms, but they should slot in quite well once you understand the way Lucy works. > Does Lucy consider the relative importance of the search terms > themselves? For example, searching for [a b c d] would imply that > those terms' importance declines from left to right, with 'a' being > the most important, etc. I think there was a Page/Brin paper on this > somewhere on the 'tubes. Not by default. It does support term weighting, though, and it's very easy to change the way that Queries are produced from the input text. In general, creating custom scorers is involved and will require a fairly deep understanding of the architecture, but changing the way that Queries are parsed is easy and self-contained. Might be a good first project with Lucy. > Phrase searches > --------------- > I see this is supported. Hard to quantify, I know, but by what factor > is phrase-searching slower than an equivalent term search? Hard to quantify, but fast. Probably even "really fast", since it short circuits efficiently. Depending on how common the terms are, searching for "A B" as a phrase is probably 25% more efficient than searching for (A B) as independent terms. Adding proximity other than "next word" would slow things considerably, but I wouldn't be afraid of pushing phrases hard. > Spelling suggestions > -------------------- > I may have missed this one in the docs: does Lucy support suggested > spelling (a-la Google). One could always use a dictionary, but it > would be nice if Lucy built up a dictionary based on the terms > encountered during indexing. The dictionary exists, but I don't think there is a good standard API for accessing it for this purpose. This is a case where the warning about API changes might matter. But it's so easy to access, that once you have a good way to do it you might be able to propose it as part of the standard. Or maybe this has already been done? > Merging/optimization > -------------------- > Merging multiple indexes into larger ones is supported. I see there > is also an 'optimize' for faster searching; can one update an index > with newer pages after such an optimization, or is it a one-way > street? I think currently everything can still be updated. Lucy makes it easy to treat multiple index segments as one, and then to later merge them to actually be one. Adding another segment after the merge for new data should be fine. I think you can rinse and repeat as desired. > Index checking/verification > --------------------------- > In a cluster environment all kinds of things go wrong on a weekly > basis - when this happens during indexing or merging indexes can be > left in a broken state leading to problems in batch processing. Does > Lucy have an index-verifier (a-la fsck) to scan an index and report > errors (not fix, just check and report)? I don't think there is a built-in, but I'm pretty sure it exists at the testing level. Should be easy to rig up what you want out of the parts if you don't find it in the box. > Which version? > -------------- > With index format stability being important, which version should I > consider using? 0.2.x incubating, or trunk? Ask Marvin to be sure, but I'd suggest Trunk. I don't think index format is going to have any major upheavals before release, and there may be some things you do that are worth contributing back to the main line. This project is really quite conservative, and rarely are there API changes that affect anyone who sticks to the official interface. > Language/binding > ---------------- > I see Perl can be used during indexing/searching, how about PHP on the > search side? Presumably PHP bindings (for search-related bindings at > least) are on the horizon/done? Not that important, just wondering. On the horizon, and waiting for someone with PHP/C integration experience to take it on. It's on a par with plans for Ruby and Python support. It's anticipated, no major hurdles are forseen, but it will take considerable work by someone with familiarity with both Lucy and the language in question. > Scale > ----- > Anyone using Lucy on a sizeable index split across nodes in a cluster? > By sizeable I mean > 1-2TB. If so, how's your search times (yes, I > know, it depends on > caching/memory/IO/CPUs/#nodes)? I've thought it through a few times, but haven't actually run anything. My guess is that 1-2TB is easily achievable. At this level, I think you could probably get sub-second response time by just putting the full index on a single machine with 128G RAM and some SSD's, then replicate setups as needed until you have the throughput you need. Or build a cluster with 2-4TB of RAM shared between them, never touch disk, and the cluster should fly. Lucy is designed to be very memory efficient, so if you can come anywhere close to fitting your index in RAM (whether on one machine or many) you should be as fast (faster?) than anything out there. It's when you get closer to the PB scale, when you are trying to trade off seconds of disk transfer latency with hundreds of requests per second that you'll really need to get really creative. Now for someone to step in and correct my gaffes, Nathan Kurz [email protected]
