Re: [lucy-user] Lucy questions wrt production, ranking, etc

Nathan Kurz Thu, 08 Sep 2011 12:58:51 -0700

On Thu, Sep 8, 2011 at 6:53 AM, goran kent <[email protected]> wrote:
> Hi,
>
> Early-adopter here.
>
> I'm considering Lucy for a new project (and I must say, the docs are
> nice and it's Perl/C which is always welcome in this day and age).


Hi Goran, and welcome to Lucy!  I'll jump in to give you some
preliminary answers.  I'm familiar with the code base, but out of
touch with some of the recent development.  So I should be able to
give you answers that are basically right, but I might mangle some
important details due to recent changes.  But I trust that someone
else will jump in to correct those parts soon enough.

> So,... I gather from the mailing list that it's production ready, but
> officially API-unstable.  Does API-unstable mean the index format may
> change any time soon, eg, before the first stable release?

No, I'm not sure what the official claim is, but it's unlikely that
index format will change enough to hurt you.  Rather than having a
single monolithic index format, Lucy allows multiple formats to
coexist.  The index format should be stable through release.  And due
to the class structure, even if the mainline changes, it should be
possible to have back compatibility for as long as you need it.

> The environment is distributed search across a cluster with the intent
> of keeping search-time sub-second - 3s at most (folks are spoilt by
> the elephant in the industry, so they lose interest if the page does
> not return in that time).
>
> I see from the docs that distributed search is supported, else it
> would be a non-starter.

This excites me too, but I don't know that anyone is pushing it's
limits yet.  But architecturally, I think it's well designed to allow
really fast clusters of in-ram search.  Talking about 3 seconds makes
it sound like you're willing to hit disk:  you might need some intense
tuning here, depending on how you deal with really common stopwords.
 Also, there are some limitations with custom sort ordering and the
like:  clusters are going to deal better with floating point than with
alphabetical, for example, and excerpts might be a little clunky to
retrieve.  Currently it's just a DocID and a score that get returned
efficiently.

> Ranking
> -------
> I need to sort results based on a floating point value (actually
> several).  I see Lucy supports this.  By how much does custom sorting
> impact search performance?

Depends on the complexity of your algorithm, but probably minimally.
If you are touching disk, that will dominate everything.  If not, you
will be Memory IO bound but have lots of excess CPU available to
crunch the numbers.

> What about term proximity in documents?  Will a matching document rank
> higher than another if two (or whatever being searched for) terms are
> physically located closer together?  Or is ranking based only on a
> term count ignoring positional info?
>
> What if the matching terms are physically closer to the TOP of the document?

It's doesn't do this particularly well currently.  Currently  TF/IDF
scoring is pretty much be-all and end-all for the built in ranking.
But adding or changing to the approach you mention has long been a
design goal, should be supported by the architecture, and is just
waiting for the right person with the right itch.  You'll probably
have to write your own scoring algorithms, but they should slot in
quite well once you understand the way Lucy works.

> Does Lucy consider the relative importance of the search terms
> themselves?  For example, searching for [a b c d] would imply that
> those terms' importance declines from left to right, with 'a' being
> the most important, etc.  I think there was a Page/Brin paper on this
> somewhere on the 'tubes.

Not by default.  It does support term weighting, though, and it's very
easy to change the way that Queries are produced from the input text.
In general, creating custom scorers is involved and will require a
fairly deep understanding of the architecture, but changing the way
that Queries are parsed is easy and self-contained.  Might be a good
first project with Lucy.

> Phrase searches
> ---------------
> I see this is supported.  Hard to quantify, I know, but by what factor
> is phrase-searching slower than an equivalent term search?

Hard to quantify, but fast.   Probably even "really fast", since it
short circuits efficiently.  Depending on how common the terms are,
searching for "A B" as a phrase is probably 25% more efficient than
searching for (A B) as independent terms.   Adding proximity other
than "next word" would slow things considerably, but I wouldn't be
afraid of pushing phrases hard.

> Spelling suggestions
> --------------------
> I may have missed this one in the docs:  does Lucy support suggested
> spelling (a-la Google).  One could always use a dictionary, but it
> would be nice if Lucy built up a dictionary based on the terms
> encountered during indexing.

The dictionary exists, but I don't think there is a good standard API
for accessing it for this purpose.  This is a case where the warning
about API changes might matter.  But it's so easy to access, that once
you have a good way to do it you might be able to propose it as part
of the standard. Or maybe this has already been done?

> Merging/optimization
> --------------------
> Merging multiple indexes into larger ones is supported.  I see there
> is also an 'optimize' for faster searching; can one update an index
> with newer pages after such an optimization, or is it a one-way
> street?

I think currently everything can still be updated.  Lucy makes it easy
to treat multiple index segments as one, and then to later merge them
to actually be one.  Adding another segment after the merge for new
data should be fine.  I think you can rinse and repeat as desired.

> Index checking/verification
> ---------------------------
> In a cluster environment all kinds of things go wrong on a weekly
> basis - when this happens during indexing or merging indexes can be
> left in a broken state leading to problems in batch processing.  Does
> Lucy have an index-verifier (a-la fsck) to scan an index and report
> errors (not fix, just check and report)?

I don't think there is a built-in, but I'm pretty sure it exists at
the testing level.  Should be easy to rig up what you want out of the
parts if you don't find it in the box.

> Which version?
> --------------
> With index format stability being important, which version should I
> consider using?  0.2.x incubating, or trunk?

Ask Marvin to be sure, but I'd suggest Trunk.  I don't think index
format is going to have any major upheavals before release, and there
may be some things you do that are worth contributing back to the main
line.   This project is really quite conservative, and rarely are
there API changes that affect anyone who sticks to the official
interface.

> Language/binding
> ----------------
> I see Perl can be used during indexing/searching, how about PHP on the
> search side?  Presumably PHP bindings (for search-related bindings at
> least) are on the horizon/done?  Not that important, just wondering.

On the horizon, and waiting for someone with PHP/C integration
experience to take it on.  It's on a par with plans for Ruby and
Python support.  It's anticipated, no major hurdles are forseen, but
it will take considerable work by someone with familiarity with both
Lucy and the language in question.

> Scale
> -----
> Anyone using Lucy on a sizeable index split across nodes in a cluster?
>  By sizeable I mean > 1-2TB.  If so, how's your search times (yes, I
> know, it depends on
> caching/memory/IO/CPUs/#nodes)?

I've thought it through a few times, but haven't actually run
anything.  My guess is that 1-2TB is easily achievable.   At this
level, I think you could probably get sub-second response time  by
just putting the full index on a single machine with 128G RAM and some
SSD's, then replicate setups as needed until you have the throughput
you need.    Or build a cluster with 2-4TB of RAM shared between them,
never touch disk, and the cluster should fly.

Lucy is designed to be very memory efficient, so if you can come
anywhere close to fitting your index in RAM (whether on one machine or
many) you should be as fast (faster?) than anything out there.   It's
when you get closer to  the PB scale, when you are trying to trade off
seconds of disk transfer latency with  hundreds of requests per second
that you'll really need to get really creative.

Now for someone to step in and correct my gaffes,

Nathan Kurz
[email protected]

Re: [lucy-user] Lucy questions wrt production, ranking, etc

Reply via email to