Re: [PROPOSAL] index server project

Doug Cutting Mon, 30 Oct 2006 15:54:17 -0800

Yonik Seeley wrote:

On 10/18/06, Doug Cutting <[EMAIL PROTECTED]> wrote:

We assume that, within an index, a file with a given name is written
only once.


Is this necessary, and will we need the lockless patch (that avoids
renaming or rewriting *any* files), or is Lucene's current index
behavior sufficient?

It's not strictly required, but it would make index synchronization alot simpler. Yes, I was assuming the lockless patch would be committedto Lucene before this project gets very far. Something more than thatwould be required in order to keep old versions, but this could be assimple as a Directory subclass that refuses to remove files for a time.

The search side seems straightforward enough, but I haven't totally
figured out how the update side should work.

The master should be out of the loop as much as possible. One approachis that clients randomly assign documents to indexes and send theupdates directly to the indexing node. Alternately, clients might indexlocally, then ship the updates to a node packaged as an index. That wasthe intent of the addIndex method.

One potental problem is a document overwrite implemented as a delete
then an add.
More than one client doing this for the same document could result in
0 or 2 documents, instead of 1.  I guess clients will just need to be
relatively coordinated in their activities.

Good point. Either the two clients must coordinate, to make sure thatthey're not updating the same document at the same time, or use astrategy where updates are routed to the slave that contained the oldversion of the document. That would require a broadcast query to figureout which slave that is.

It's unfortunate the master needs to be involved on every document add.

That should not normally be the case. Clients can cache the set ofwritable index locations and directly submit new documents withoutinvolving the master.

If deletes were broadcast, and documents could go to any partition,
that would be one way around it (with the downside of a less powerful
master that could implement certain distribution policies).
Another way to lessen the master-in-the-middle cost is to make sure
one can aggregate small requests:
   IndexLocation[] getUpdateableIndex(String[] id);

I'd assumed that the updateable version of an index does not move aroundvery often. Perhaps a lease mechanism is required. For example, a callto getUpdateableIndex might be valid for ten minutes.

We might consider a delete() on the master interface too. That way itcould

 3) hide the delete policy (broadcast or directl-to-server-that-has-doc)
2) potentially do some batching of deletes
1) simply do the delete locally if there is a single index partition
and this is a combination master/searcher

I'm reticent to put any frequently-made call on the master. I'd preferto keep the master only involved at an executive level, with allper-document and per-query traffic going directly from client to slave.

It seems like the master might want to be involved in commits too, or
maybe we just rely on the slave to master heartbeat to kick of
immediately after a commit so that index replication can be initiated?

I like the latter approach. New versions are only published asfrequently as clients poll the master for updated IndexLocations.Clients keep a cache of both readable and updatable index locations thatare periodically refreshed.

I was not imagining a real-time system, where the next query after adocument is added would always include that document. Is that arequirement? That's harder.

At this point I'm mostly trying to see if this functionality would meetthe needs of Solr, Nutch and others.

Must we include a notion of document identity and/or document version inthe mechanism? Would that facillitate updates and coherency?

In Nutch a typical case is that you have a bunch of URLs with contentthat may-or-may-not have been previously indexed. The approach I'mcurrently leaning towards is that we'd broadcast the deletions of all ofthese to all slaves, then add index them to randomly assigned indexes.In Nutch multiple clients would naturally be coordinated, since each urlis represented only once in each update cycle.


Doug

Re: [PROPOSAL] index server project

Reply via email to