Re: Solr updateRequestHandler and performance vs. atomicity

2010-06-02 Thread Otis Gospodnetic
While preparing material for 
http://blog.sematext.com/2010/06/02/lucene-digest-may-2010-3/ I came across 
something that looks relevant:

https://issues.apache.org/jira/browse/LUCENE-2456

...where the author wrote this:

"In conclusion, this directory attempts to marry the rich search-based 
query language of Lucene with the distributed fault-tolerant database 
that is Cassandra. By delegating the responsibilities of replication, 
durability and elasticity to the directory, we free the layers above 
from such non-functional concerns. Our hope is that users will choose to make 
their large-scale indices instantly scalable by seamlessly 
migrating them to this type of directory (using 
Directory#copyTo(Directory))."
 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Yonik Seeley 
> To: dev@lucene.apache.org
> Sent: Tue, May 25, 2010 8:59:29 AM
> Subject: Re: Solr updateRequestHandler and performance vs. atomicity
> 
> On Mon, May 24, 2010 at 9:10 AM,  <
> ymailto="mailto:karl.wri...@nokia.com"; 
> href="mailto:karl.wri...@nokia.com";>karl.wri...@nokia.com> wrote:
> 
> In particular, it would be nice to be able to post documents in such a 
> way
> that you can guarantee that the document is permanently in Solr’s 
> queue,
> safe in the event of a Solr restart, etc., even if the document 
> has not yet
> been “committed”.

Yep, this is a longer term goal of 
> SolrCloud.
And to be truly safe, committing to stable storage is not enough 
> -
that still might crash and never recover.  One needs to write 
> to
multiple nodes.

-Yonik

> target=_blank 
> >http://www.lucidimagination.com

-
To 
> unsubscribe, e-mail: 
> href="mailto:dev-unsubscr...@lucene.apache.org";>dev-unsubscr...@lucene.apache.org
For 
> additional commands, e-mail: 
> href="mailto:dev-h...@lucene.apache.org";>dev-h...@lucene.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Solr updateRequestHandler and performance vs. atomicity

2010-05-25 Thread Paul Elschot
Sounds like a distributed two phase commit is needed.
Would http://activemq.apache.org/ do the job?
If it does, camel (split off of activemq) has a lucene component
that could be of interest, too.

Regards,
Paul Elschot

Op dinsdag 25 mei 2010 14:59:29 schreef Yonik Seeley:
> On Mon, May 24, 2010 at 9:10 AM,   wrote:
> > In particular, it would be nice to be able to post documents in such a way
> > that you can guarantee that the document is permanently in Solr’s queue,
> > safe in the event of a Solr restart, etc., even if the document has not yet
> > been “committed”.
> 
> Yep, this is a longer term goal of SolrCloud.
> And to be truly safe, committing to stable storage is not enough -
> that still might crash and never recover.  One needs to write to
> multiple nodes.
> 
> -Yonik
> http://www.lucidimagination.com
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 
> 
> 

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Solr updateRequestHandler and performance vs. atomicity

2010-05-25 Thread Yonik Seeley
On Mon, May 24, 2010 at 9:10 AM,   wrote:
> In particular, it would be nice to be able to post documents in such a way
> that you can guarantee that the document is permanently in Solr’s queue,
> safe in the event of a Solr restart, etc., even if the document has not yet
> been “committed”.

Yep, this is a longer term goal of SolrCloud.
And to be truly safe, committing to stable storage is not enough -
that still might crash and never recover.  One needs to write to
multiple nodes.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Solr updateRequestHandler and performance vs. atomicity

2010-05-25 Thread karl.wright
I created SOLR-1924.  Let me know if it's clear enough, or if you'd like me to 
modify the ticket in any way.
Thanks,
Karl

From: ext Mark Miller [markrmil...@gmail.com]
Sent: Tuesday, May 25, 2010 5:20 AM
To: dev@lucene.apache.org
Subject: Re: Solr updateRequestHandler and performance vs. atomicity

Okay, makes sense.

Perhaps one easier way to explore this is the aux index idea, but only
use stored fields - that gets us lucenes commit stuff for free,
without analysis.  Then there is just the more difficult part of
ensuring transfer from this mini index to the main index for indexing
on commit.

I'd def open a jira issue for this functionality. You will still pay
for committing so often (frequent fsync is costly, especially on some
fs) but I'm sure you can pay a lot less than currently.

On Tuesday, May 25, 2010,   wrote:
> The reason for this is simple.  LCF keeps track of which documents it has 
> handed off to Solr, and has a fairly involved mechanism for making sure that 
> every document LCF *thinks* got there, actually does.  It even uses a 
> mechanism akin to a 2-phase commit to make sure that its internal records and 
> those of the downstream index are never out of synch.
>
> Now, along comes Solr, and the system loses a good deal of its resilience, 
> because there is a chance that somebody or something will kick Solr after a 
> document (or a set of documents) has been transmitted to it, but LCF will 
> have no awareness of this situation at all, and will thus never try to fix 
> the problem on the next job run (or whatever).  So instead of automatic 
> resilience, you get one of two possible solutions:
>
> (1) Manual intervention.  Somebody has to manually inform LCF of the Solr 
> hiccup, and LCF thus will have to invalidate all documents it ever sent to 
> Solr (because it doesn't know what documents could have been affected).
> (2) A solr commit on every post.  This slows down LCF significantly, because 
> each document post takes something like 10x as long to do.
>
> Does this help?
> Karl
>
> -Original Message-
> From: ext Mark Miller [mailto:markrmil...@gmail.com]
> Sent: Monday, May 24, 2010 4:40 PM
> To: dev@lucene.apache.org
> Subject: Re: Solr updateRequestHandler and performance vs. atomicity
>
> Indexing a doc won't be as fast as raw disk IO. But you won't be doing
> just raw disk IO to guarantee acceptance. And that will have a cost and
> complexity that really makes me wonder if its worth the speed advantage.
> For very large documents with complex analyzers...perhaps. But its not
> going to be an easily implementable feature (if its a true guarantee).
> And its still got to involve logs and/or fsync and all that.
>
> The reasoning for this is not ringing a bell - can you elaborate on the
> motivations?
>
> Is this so that you can commit on every doc? Every few docs?
>
> I can def see how this would be desirable in general, but just to be
> clear on your motivations.
>
>
> - Mark
>
> On 5/24/10 10:03 PM, karl.wri...@nokia.com wrote:
>> Hi Mark,
>>
>> Unfortunately, indexing performance *is* of concern, otherwise I'd already 
>> be committing on every post.
>>
>> If your guess is correct, you are basically saying that adding a document to 
>> an index in Solr/Lucene is just as fast as writing that file directly to the 
>> disk.  Because, obviously, if we want guaranteed delivery, that's what we'd 
>> have to do.  But I think this is worth the experiment - Solr/Lucene may be 
>> fast, but I have doubts that it can perform as well as raw disk I/O and 
>> still manage to do anything in the way of document analysis or (heaven 
>> forbid) text extraction.
>>
>>
>>
>> -Original Message-
>> From: ext Mark Miller [mailto:markrmil...@gmail.com]
>> Sent: Monday, May 24, 2010 3:33 PM
>> To: dev@lucene.apache.org
>> Subject: Re: Solr updateRequestHandler and performance vs. atomicity
>>
>> On 5/24/10 3:10 PM, karl.wri...@nokia.com wrote:
>>> Hi all,
>>> It seems to me that the "commit" logic in the Solr updateRequestHandler
>>> (or wherever the logic is actually located) conflates two different
>>> semantics. One semantic is what you need to do to make the index process
>>> perform well. The other semantic is guaranteed atomicity of document
>>> reception by Solr.
>>> In particular, it would be nice to be able to post documents in such a
>>> way that you can guarantee that the document is permanently in Solr's
>>> queue, safe in the event of a Solr restart, etc., even if the document
>>> has not yet been "commi

RE: Solr updateRequestHandler and performance vs. atomicity

2010-05-25 Thread karl.wright
Hi Simon,

I think you are on the right track.

I believe it is not even possible to write a middleware-style layer that stores 
documents and performs periodic commits on its own, because the update request 
handler never ACKs individual documents on a commit, but merely everything it 
has seen since the last time Solr bounced.  So you have this potential scenario:

- middleware layer receives document 1, saves it
- middleware layer receives document 2, saves it
Now it's time for the commit, so:
- middleware layer sends document 1 to updateRequestHandler
- solr is restarted, dropping all uncommitted documents on the floor
- middleware layer sends document 2 to updateRequestHandler
- middleware layer sends COMMIT to updateRequestHandler, but solr adds only 
document 2 to the index
- middleware believes incorrectly that it has successfully committed both 
documents

If I were any kind of mathematician, I suspect I could even prove that the 
current API has this inherent race condition built into its semantics.

I never claimed this was going to be easy :-).  But it does seem to be 
valuable, perhaps critically so.

Karl


From: ext Simon Willnauer [simon.willna...@googlemail.com]
Sent: Monday, May 24, 2010 4:29 PM
To: dev@lucene.apache.org
Subject: Re: Solr updateRequestHandler and performance vs. atomicity

Hi Karl,

what are you describing seems to be a good usecase for something like
a message queue where you push a document or record to a queue which
guarantees the queues persistence. I look at this from a little
different perspective, in a distributed environment you would have to
guarantee delivery to a single solr instance but on several or at
least n instances but that is a different story.

>From a Solr point of view this sounds like a need for a write-ahead
log that guarantees durability and atomicity. I like this idea as it
might also solve lots of problems in distributed environments (solr
cloud) etc.

Very interesting topic - should investigate more in this direction


simon


On Mon, May 24, 2010 at 10:03 PM,   wrote:
> Hi Mark,
>
> Unfortunately, indexing performance *is* of concern, otherwise I'd already be 
> committing on every post.
>
> If your guess is correct, you are basically saying that adding a document to 
> an index in Solr/Lucene is just as fast as writing that file directly to the 
> disk.  Because, obviously, if we want guaranteed delivery, that's what we'd 
> have to do.  But I think this is worth the experiment - Solr/Lucene may be 
> fast, but I have doubts that it can perform as well as raw disk I/O and still 
> manage to do anything in the way of document analysis or (heaven forbid) text 
> extraction.
>
>
>
> -Original Message-
> From: ext Mark Miller [mailto:markrmil...@gmail.com]
> Sent: Monday, May 24, 2010 3:33 PM
> To: dev@lucene.apache.org
> Subject: Re: Solr updateRequestHandler and performance vs. atomicity
>
> On 5/24/10 3:10 PM, karl.wri...@nokia.com wrote:
>> Hi all,
>> It seems to me that the "commit" logic in the Solr updateRequestHandler
>> (or wherever the logic is actually located) conflates two different
>> semantics. One semantic is what you need to do to make the index process
>> perform well. The other semantic is guaranteed atomicity of document
>> reception by Solr.
>> In particular, it would be nice to be able to post documents in such a
>> way that you can guarantee that the document is permanently in Solr's
>> queue, safe in the event of a Solr restart, etc., even if the document
>> has not yet been "committed".
>> This issue came up in the LCF talk that I gave, and I initially thought
>> that separating the two kinds of events would necessarily be an LCF
>> change, but the more I thought about it the more I realized that other
>> Solr indexing clients may also benefit from such a separation.
>> Does anyone agree? Where should this logic properly live?
>> Thanks,
>> Karl
>
> Its an interesting idea - but I think you would likely pay a similar
> cost to guarantee reception as you would to commit (also, I'm not sure
> Lucene guarantees it - it works for consistency, but I'm not so sure it
> achieves durability).
>
> I can think of two things offhand -
>
> Perhaps store the text and use fsync to quasi guarantee acceptance -
> then index from the store on the commit.
>
> Another simpler idea if only the separation is important and not the
> performance - index to another side index, taking advantage of Lucene's
> current commit functionality, and then use addIndex to merge to the main
> index on commit.
>
> Just spit balling though.
>
> I think this woul

Re: Solr updateRequestHandler and performance vs. atomicity

2010-05-25 Thread Mark Miller
Okay, makes sense.

Perhaps one easier way to explore this is the aux index idea, but only
use stored fields - that gets us lucenes commit stuff for free,
without analysis.  Then there is just the more difficult part of
ensuring transfer from this mini index to the main index for indexing
on commit.

I'd def open a jira issue for this functionality. You will still pay
for committing so often (frequent fsync is costly, especially on some
fs) but I'm sure you can pay a lot less than currently.

On Tuesday, May 25, 2010,   wrote:
> The reason for this is simple.  LCF keeps track of which documents it has 
> handed off to Solr, and has a fairly involved mechanism for making sure that 
> every document LCF *thinks* got there, actually does.  It even uses a 
> mechanism akin to a 2-phase commit to make sure that its internal records and 
> those of the downstream index are never out of synch.
>
> Now, along comes Solr, and the system loses a good deal of its resilience, 
> because there is a chance that somebody or something will kick Solr after a 
> document (or a set of documents) has been transmitted to it, but LCF will 
> have no awareness of this situation at all, and will thus never try to fix 
> the problem on the next job run (or whatever).  So instead of automatic 
> resilience, you get one of two possible solutions:
>
> (1) Manual intervention.  Somebody has to manually inform LCF of the Solr 
> hiccup, and LCF thus will have to invalidate all documents it ever sent to 
> Solr (because it doesn't know what documents could have been affected).
> (2) A solr commit on every post.  This slows down LCF significantly, because 
> each document post takes something like 10x as long to do.
>
> Does this help?
> Karl
>
> -Original Message-
> From: ext Mark Miller [mailto:markrmil...@gmail.com]
> Sent: Monday, May 24, 2010 4:40 PM
> To: dev@lucene.apache.org
> Subject: Re: Solr updateRequestHandler and performance vs. atomicity
>
> Indexing a doc won't be as fast as raw disk IO. But you won't be doing
> just raw disk IO to guarantee acceptance. And that will have a cost and
> complexity that really makes me wonder if its worth the speed advantage.
> For very large documents with complex analyzers...perhaps. But its not
> going to be an easily implementable feature (if its a true guarantee).
> And its still got to involve logs and/or fsync and all that.
>
> The reasoning for this is not ringing a bell - can you elaborate on the
> motivations?
>
> Is this so that you can commit on every doc? Every few docs?
>
> I can def see how this would be desirable in general, but just to be
> clear on your motivations.
>
>
> - Mark
>
> On 5/24/10 10:03 PM, karl.wri...@nokia.com wrote:
>> Hi Mark,
>>
>> Unfortunately, indexing performance *is* of concern, otherwise I'd already 
>> be committing on every post.
>>
>> If your guess is correct, you are basically saying that adding a document to 
>> an index in Solr/Lucene is just as fast as writing that file directly to the 
>> disk.  Because, obviously, if we want guaranteed delivery, that's what we'd 
>> have to do.  But I think this is worth the experiment - Solr/Lucene may be 
>> fast, but I have doubts that it can perform as well as raw disk I/O and 
>> still manage to do anything in the way of document analysis or (heaven 
>> forbid) text extraction.
>>
>>
>>
>> -Original Message-
>> From: ext Mark Miller [mailto:markrmil...@gmail.com]
>> Sent: Monday, May 24, 2010 3:33 PM
>> To: dev@lucene.apache.org
>> Subject: Re: Solr updateRequestHandler and performance vs. atomicity
>>
>> On 5/24/10 3:10 PM, karl.wri...@nokia.com wrote:
>>> Hi all,
>>> It seems to me that the "commit" logic in the Solr updateRequestHandler
>>> (or wherever the logic is actually located) conflates two different
>>> semantics. One semantic is what you need to do to make the index process
>>> perform well. The other semantic is guaranteed atomicity of document
>>> reception by Solr.
>>> In particular, it would be nice to be able to post documents in such a
>>> way that you can guarantee that the document is permanently in Solr's
>>> queue, safe in the event of a Solr restart, etc., even if the document
>>> has not yet been "committed".
>>> This issue came up in the LCF talk that I gave, and I initially thought
>>> that separating the two kinds of events would necessarily be an LCF
>>> change, but the more I thought about it the more I realized that other
>>> Solr indexing clients may also benefit from such a separation.
>>> 

RE: Solr updateRequestHandler and performance vs. atomicity

2010-05-24 Thread karl.wright
The reason for this is simple.  LCF keeps track of which documents it has 
handed off to Solr, and has a fairly involved mechanism for making sure that 
every document LCF *thinks* got there, actually does.  It even uses a mechanism 
akin to a 2-phase commit to make sure that its internal records and those of 
the downstream index are never out of synch.

Now, along comes Solr, and the system loses a good deal of its resilience, 
because there is a chance that somebody or something will kick Solr after a 
document (or a set of documents) has been transmitted to it, but LCF will have 
no awareness of this situation at all, and will thus never try to fix the 
problem on the next job run (or whatever).  So instead of automatic resilience, 
you get one of two possible solutions:

(1) Manual intervention.  Somebody has to manually inform LCF of the Solr 
hiccup, and LCF thus will have to invalidate all documents it ever sent to Solr 
(because it doesn't know what documents could have been affected).
(2) A solr commit on every post.  This slows down LCF significantly, because 
each document post takes something like 10x as long to do.

Does this help?
Karl

-Original Message-
From: ext Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Monday, May 24, 2010 4:40 PM
To: dev@lucene.apache.org
Subject: Re: Solr updateRequestHandler and performance vs. atomicity

Indexing a doc won't be as fast as raw disk IO. But you won't be doing 
just raw disk IO to guarantee acceptance. And that will have a cost and 
complexity that really makes me wonder if its worth the speed advantage. 
For very large documents with complex analyzers...perhaps. But its not 
going to be an easily implementable feature (if its a true guarantee). 
And its still got to involve logs and/or fsync and all that.

The reasoning for this is not ringing a bell - can you elaborate on the 
motivations?

Is this so that you can commit on every doc? Every few docs?

I can def see how this would be desirable in general, but just to be 
clear on your motivations.


- Mark

On 5/24/10 10:03 PM, karl.wri...@nokia.com wrote:
> Hi Mark,
>
> Unfortunately, indexing performance *is* of concern, otherwise I'd already be 
> committing on every post.
>
> If your guess is correct, you are basically saying that adding a document to 
> an index in Solr/Lucene is just as fast as writing that file directly to the 
> disk.  Because, obviously, if we want guaranteed delivery, that's what we'd 
> have to do.  But I think this is worth the experiment - Solr/Lucene may be 
> fast, but I have doubts that it can perform as well as raw disk I/O and still 
> manage to do anything in the way of document analysis or (heaven forbid) text 
> extraction.
>
>
>
> -Original Message-
> From: ext Mark Miller [mailto:markrmil...@gmail.com]
> Sent: Monday, May 24, 2010 3:33 PM
> To: dev@lucene.apache.org
> Subject: Re: Solr updateRequestHandler and performance vs. atomicity
>
> On 5/24/10 3:10 PM, karl.wri...@nokia.com wrote:
>> Hi all,
>> It seems to me that the "commit" logic in the Solr updateRequestHandler
>> (or wherever the logic is actually located) conflates two different
>> semantics. One semantic is what you need to do to make the index process
>> perform well. The other semantic is guaranteed atomicity of document
>> reception by Solr.
>> In particular, it would be nice to be able to post documents in such a
>> way that you can guarantee that the document is permanently in Solr's
>> queue, safe in the event of a Solr restart, etc., even if the document
>> has not yet been "committed".
>> This issue came up in the LCF talk that I gave, and I initially thought
>> that separating the two kinds of events would necessarily be an LCF
>> change, but the more I thought about it the more I realized that other
>> Solr indexing clients may also benefit from such a separation.
>> Does anyone agree? Where should this logic properly live?
>> Thanks,
>> Karl
>
> Its an interesting idea - but I think you would likely pay a similar
> cost to guarantee reception as you would to commit (also, I'm not sure
> Lucene guarantees it - it works for consistency, but I'm not so sure it
> achieves durability).
>
> I can think of two things offhand -
>
> Perhaps store the text and use fsync to quasi guarantee acceptance -
> then index from the store on the commit.
>
> Another simpler idea if only the separation is important and not the
> performance - index to another side index, taking advantage of Lucene's
> current commit functionality, and then use addIndex to merge to the main
> index on commit.
>
> Just spit balling though.
>
>

Re: Solr updateRequestHandler and performance vs. atomicity

2010-05-24 Thread Mark Miller
Indexing a doc won't be as fast as raw disk IO. But you won't be doing 
just raw disk IO to guarantee acceptance. And that will have a cost and 
complexity that really makes me wonder if its worth the speed advantage. 
For very large documents with complex analyzers...perhaps. But its not 
going to be an easily implementable feature (if its a true guarantee). 
And its still got to involve logs and/or fsync and all that.


The reasoning for this is not ringing a bell - can you elaborate on the 
motivations?


Is this so that you can commit on every doc? Every few docs?

I can def see how this would be desirable in general, but just to be 
clear on your motivations.



- Mark

On 5/24/10 10:03 PM, karl.wri...@nokia.com wrote:

Hi Mark,

Unfortunately, indexing performance *is* of concern, otherwise I'd already be 
committing on every post.

If your guess is correct, you are basically saying that adding a document to an 
index in Solr/Lucene is just as fast as writing that file directly to the disk. 
 Because, obviously, if we want guaranteed delivery, that's what we'd have to 
do.  But I think this is worth the experiment - Solr/Lucene may be fast, but I 
have doubts that it can perform as well as raw disk I/O and still manage to do 
anything in the way of document analysis or (heaven forbid) text extraction.



-Original Message-
From: ext Mark Miller [mailto:markrmil...@gmail.com]
Sent: Monday, May 24, 2010 3:33 PM
To: dev@lucene.apache.org
Subject: Re: Solr updateRequestHandler and performance vs. atomicity

On 5/24/10 3:10 PM, karl.wri...@nokia.com wrote:

Hi all,
It seems to me that the "commit" logic in the Solr updateRequestHandler
(or wherever the logic is actually located) conflates two different
semantics. One semantic is what you need to do to make the index process
perform well. The other semantic is guaranteed atomicity of document
reception by Solr.
In particular, it would be nice to be able to post documents in such a
way that you can guarantee that the document is permanently in Solr's
queue, safe in the event of a Solr restart, etc., even if the document
has not yet been "committed".
This issue came up in the LCF talk that I gave, and I initially thought
that separating the two kinds of events would necessarily be an LCF
change, but the more I thought about it the more I realized that other
Solr indexing clients may also benefit from such a separation.
Does anyone agree? Where should this logic properly live?
Thanks,
Karl


Its an interesting idea - but I think you would likely pay a similar
cost to guarantee reception as you would to commit (also, I'm not sure
Lucene guarantees it - it works for consistency, but I'm not so sure it
achieves durability).

I can think of two things offhand -

Perhaps store the text and use fsync to quasi guarantee acceptance -
then index from the store on the commit.

Another simpler idea if only the separation is important and not the
performance - index to another side index, taking advantage of Lucene's
current commit functionality, and then use addIndex to merge to the main
index on commit.

Just spit balling though.

I think this would obviously need to be an optional mode.




--
- Mark

http://www.lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Solr updateRequestHandler and performance vs. atomicity

2010-05-24 Thread Simon Willnauer
Hi Karl,

what are you describing seems to be a good usecase for something like
a message queue where you push a document or record to a queue which
guarantees the queues persistence. I look at this from a little
different perspective, in a distributed environment you would have to
guarantee delivery to a single solr instance but on several or at
least n instances but that is a different story.

>From a Solr point of view this sounds like a need for a write-ahead
log that guarantees durability and atomicity. I like this idea as it
might also solve lots of problems in distributed environments (solr
cloud) etc.

Very interesting topic - should investigate more in this direction


simon


On Mon, May 24, 2010 at 10:03 PM,   wrote:
> Hi Mark,
>
> Unfortunately, indexing performance *is* of concern, otherwise I'd already be 
> committing on every post.
>
> If your guess is correct, you are basically saying that adding a document to 
> an index in Solr/Lucene is just as fast as writing that file directly to the 
> disk.  Because, obviously, if we want guaranteed delivery, that's what we'd 
> have to do.  But I think this is worth the experiment - Solr/Lucene may be 
> fast, but I have doubts that it can perform as well as raw disk I/O and still 
> manage to do anything in the way of document analysis or (heaven forbid) text 
> extraction.
>
>
>
> -Original Message-
> From: ext Mark Miller [mailto:markrmil...@gmail.com]
> Sent: Monday, May 24, 2010 3:33 PM
> To: dev@lucene.apache.org
> Subject: Re: Solr updateRequestHandler and performance vs. atomicity
>
> On 5/24/10 3:10 PM, karl.wri...@nokia.com wrote:
>> Hi all,
>> It seems to me that the "commit" logic in the Solr updateRequestHandler
>> (or wherever the logic is actually located) conflates two different
>> semantics. One semantic is what you need to do to make the index process
>> perform well. The other semantic is guaranteed atomicity of document
>> reception by Solr.
>> In particular, it would be nice to be able to post documents in such a
>> way that you can guarantee that the document is permanently in Solr's
>> queue, safe in the event of a Solr restart, etc., even if the document
>> has not yet been "committed".
>> This issue came up in the LCF talk that I gave, and I initially thought
>> that separating the two kinds of events would necessarily be an LCF
>> change, but the more I thought about it the more I realized that other
>> Solr indexing clients may also benefit from such a separation.
>> Does anyone agree? Where should this logic properly live?
>> Thanks,
>> Karl
>
> Its an interesting idea - but I think you would likely pay a similar
> cost to guarantee reception as you would to commit (also, I'm not sure
> Lucene guarantees it - it works for consistency, but I'm not so sure it
> achieves durability).
>
> I can think of two things offhand -
>
> Perhaps store the text and use fsync to quasi guarantee acceptance -
> then index from the store on the commit.
>
> Another simpler idea if only the separation is important and not the
> performance - index to another side index, taking advantage of Lucene's
> current commit functionality, and then use addIndex to merge to the main
> index on commit.
>
> Just spit balling though.
>
> I think this would obviously need to be an optional mode.
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Solr updateRequestHandler and performance vs. atomicity

2010-05-24 Thread karl.wright
Hi Mark,

Unfortunately, indexing performance *is* of concern, otherwise I'd already be 
committing on every post.

If your guess is correct, you are basically saying that adding a document to an 
index in Solr/Lucene is just as fast as writing that file directly to the disk. 
 Because, obviously, if we want guaranteed delivery, that's what we'd have to 
do.  But I think this is worth the experiment - Solr/Lucene may be fast, but I 
have doubts that it can perform as well as raw disk I/O and still manage to do 
anything in the way of document analysis or (heaven forbid) text extraction.



-Original Message-
From: ext Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Monday, May 24, 2010 3:33 PM
To: dev@lucene.apache.org
Subject: Re: Solr updateRequestHandler and performance vs. atomicity

On 5/24/10 3:10 PM, karl.wri...@nokia.com wrote:
> Hi all,
> It seems to me that the "commit" logic in the Solr updateRequestHandler
> (or wherever the logic is actually located) conflates two different
> semantics. One semantic is what you need to do to make the index process
> perform well. The other semantic is guaranteed atomicity of document
> reception by Solr.
> In particular, it would be nice to be able to post documents in such a
> way that you can guarantee that the document is permanently in Solr's
> queue, safe in the event of a Solr restart, etc., even if the document
> has not yet been "committed".
> This issue came up in the LCF talk that I gave, and I initially thought
> that separating the two kinds of events would necessarily be an LCF
> change, but the more I thought about it the more I realized that other
> Solr indexing clients may also benefit from such a separation.
> Does anyone agree? Where should this logic properly live?
> Thanks,
> Karl

Its an interesting idea - but I think you would likely pay a similar 
cost to guarantee reception as you would to commit (also, I'm not sure 
Lucene guarantees it - it works for consistency, but I'm not so sure it 
achieves durability).

I can think of two things offhand -

Perhaps store the text and use fsync to quasi guarantee acceptance - 
then index from the store on the commit.

Another simpler idea if only the separation is important and not the 
performance - index to another side index, taking advantage of Lucene's 
current commit functionality, and then use addIndex to merge to the main 
index on commit.

Just spit balling though.

I think this would obviously need to be an optional mode.

-- 
- Mark

http://www.lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Solr updateRequestHandler and performance vs. atomicity

2010-05-24 Thread Mark Miller

On 5/24/10 3:10 PM, karl.wri...@nokia.com wrote:

Hi all,
It seems to me that the “commit” logic in the Solr updateRequestHandler
(or wherever the logic is actually located) conflates two different
semantics. One semantic is what you need to do to make the index process
perform well. The other semantic is guaranteed atomicity of document
reception by Solr.
In particular, it would be nice to be able to post documents in such a
way that you can guarantee that the document is permanently in Solr’s
queue, safe in the event of a Solr restart, etc., even if the document
has not yet been “committed”.
This issue came up in the LCF talk that I gave, and I initially thought
that separating the two kinds of events would necessarily be an LCF
change, but the more I thought about it the more I realized that other
Solr indexing clients may also benefit from such a separation.
Does anyone agree? Where should this logic properly live?
Thanks,
Karl


Its an interesting idea - but I think you would likely pay a similar 
cost to guarantee reception as you would to commit (also, I'm not sure 
Lucene guarantees it - it works for consistency, but I'm not so sure it 
achieves durability).


I can think of two things offhand -

Perhaps store the text and use fsync to quasi guarantee acceptance - 
then index from the store on the commit.


Another simpler idea if only the separation is important and not the 
performance - index to another side index, taking advantage of Lucene's 
current commit functionality, and then use addIndex to merge to the main 
index on commit.


Just spit balling though.

I think this would obviously need to be an optional mode.

--
- Mark

http://www.lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Solr updateRequestHandler and performance vs. atomicity

2010-05-24 Thread Peter Wolanin
We us an autocommit with Solr and I've had this worry too - apparently
if you get a hard crash Solr will roll back the not-yet-committed
docs.

I don't think it's happened more than once in a year, but still possible.

-Peter

On Mon, May 24, 2010 at 9:10 AM,   wrote:
> Hi all,
>
> It seems to me that the “commit” logic in the Solr updateRequestHandler (or
> wherever the logic is actually located) conflates two different semantics.
> One semantic is what you need to do to make the index process perform well.
> The other semantic is guaranteed atomicity of document reception by Solr.
>
> In particular, it would be nice to be able to post documents in such a way
> that you can guarantee that the document is permanently in Solr’s queue,
> safe in the event of a Solr restart, etc., even if the document has not yet
> been “committed”.
>
> This issue came up in the LCF talk that I gave, and I initially thought that
> separating the two kinds of events would necessarily be an LCF change, but
> the more I thought about it the more I realized that other Solr indexing
> clients may also benefit from such a separation.
>
> Does anyone agree?  Where should this logic properly live?
>
> Thanks,
> Karl
>
>
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org