RE: Solr updateRequestHandler and performance vs. atomicity

karl.wright Mon, 24 May 2010 15:37:27 -0700

The reason for this is simple.  LCF keeps track of which documents it has 
handed off to Solr, and has a fairly involved mechanism for making sure that 
every document LCF *thinks* got there, actually does.  It even uses a mechanism 
akin to a 2-phase commit to make sure that its internal records and those of 
the downstream index are never out of synch.

Now, along comes Solr, and the system loses a good deal of its resilience, 
because there is a chance that somebody or something will kick Solr after a 
document (or a set of documents) has been transmitted to it, but LCF will have 
no awareness of this situation at all, and will thus never try to fix the 
problem on the next job run (or whatever).  So instead of automatic resilience, 
you get one of two possible solutions:

(1) Manual intervention.  Somebody has to manually inform LCF of the Solr 
hiccup, and LCF thus will have to invalidate all documents it ever sent to Solr 
(because it doesn't know what documents could have been affected).
(2) A solr commit on every post.  This slows down LCF significantly, because 
each document post takes something like 10x as long to do.

Does this help?
Karl

-----Original Message-----
From: ext Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Monday, May 24, 2010 4:40 PM
To: dev@lucene.apache.org
Subject: Re: Solr updateRequestHandler and performance vs. atomicity

Indexing a doc won't be as fast as raw disk IO. But you won't be doing 
just raw disk IO to guarantee acceptance. And that will have a cost and 
complexity that really makes me wonder if its worth the speed advantage. 
For very large documents with complex analyzers...perhaps. But its not 
going to be an easily implementable feature (if its a true guarantee). 
And its still got to involve logs and/or fsync and all that.

The reasoning for this is not ringing a bell - can you elaborate on the 
motivations?

Is this so that you can commit on every doc? Every few docs?

I can def see how this would be desirable in general, but just to be 
clear on your motivations.

- Mark

On 5/24/10 10:03 PM, karl.wri...@nokia.com wrote:
> Hi Mark,
>
> Unfortunately, indexing performance *is* of concern, otherwise I'd already be 
> committing on every post.
>
> If your guess is correct, you are basically saying that adding a document to 
> an index in Solr/Lucene is just as fast as writing that file directly to the 
> disk.  Because, obviously, if we want guaranteed delivery, that's what we'd 
> have to do.  But I think this is worth the experiment - Solr/Lucene may be 
> fast, but I have doubts that it can perform as well as raw disk I/O and still 
> manage to do anything in the way of document analysis or (heaven forbid) text 
> extraction.
>
>
>
> -----Original Message-----
> From: ext Mark Miller [mailto:markrmil...@gmail.com]
> Sent: Monday, May 24, 2010 3:33 PM
> To: dev@lucene.apache.org
> Subject: Re: Solr updateRequestHandler and performance vs. atomicity
>
> On 5/24/10 3:10 PM, karl.wri...@nokia.com wrote:
>> Hi all,
>> It seems to me that the "commit" logic in the Solr updateRequestHandler
>> (or wherever the logic is actually located) conflates two different
>> semantics. One semantic is what you need to do to make the index process
>> perform well. The other semantic is guaranteed atomicity of document
>> reception by Solr.
>> In particular, it would be nice to be able to post documents in such a
>> way that you can guarantee that the document is permanently in Solr's
>> queue, safe in the event of a Solr restart, etc., even if the document
>> has not yet been "committed".
>> This issue came up in the LCF talk that I gave, and I initially thought
>> that separating the two kinds of events would necessarily be an LCF
>> change, but the more I thought about it the more I realized that other
>> Solr indexing clients may also benefit from such a separation.
>> Does anyone agree? Where should this logic properly live?
>> Thanks,
>> Karl
>
> Its an interesting idea - but I think you would likely pay a similar
> cost to guarantee reception as you would to commit (also, I'm not sure
> Lucene guarantees it - it works for consistency, but I'm not so sure it
> achieves durability).
>
> I can think of two things offhand -
>
> Perhaps store the text and use fsync to quasi guarantee acceptance -
> then index from the store on the commit.
>
> Another simpler idea if only the separation is important and not the
> performance - index to another side index, taking advantage of Lucene's
> current commit functionality, and then use addIndex to merge to the main
> index on commit.
>
> Just spit balling though.
>
> I think this would obviously need to be an optional mode.
>

-- 
- Mark

http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

RE: Solr updateRequestHandler and performance vs. atomicity

Reply via email to