Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restarts

jchang Wed, 13 Jan 2010 11:08:53 -0800

I don't specifically need a cluster of servers writing indexes.  Actually, at
the moment, I only have one server, but multiple message consuming threads,
so I still land back at the same problem of contention for the index lock.
Why do I have multiple message consumers?  Speed...I wanted to dequeue my
items to be indexed fast.  However, I'm getting the impression that may have
been a foolish effort.  I find that only having one writer thread is not
much slower than having 20, which makes sense if they are all waiting on one
file.  If only one writer thread can be fast enough (which gets rid of
timeout exceptions that I asked about in a different thread), that that is
good enough for me.


Do you know what kind of index writes per second I can hope to hit with one
writer thread?  I guess it depends on many factors.  

Also, I know 2.9.0 is faster than 2.4.0 (which I'm on), but I'm not sure I
can move up to 2.9.0 really easily because all my Lucene usage is wrapped in
Compass, which does not yet support 2.9.0.  I think I'd have to rewrite my
service to use straight Lucene, which might be a good idea, but I can't do
quickly.  We don't use Solr.

Thanks for your help thus far and thanks in advance for any more responses.



Jake Mannix wrote:
> 
> On Tue, Jan 12, 2010 at 8:15 PM, Otis Gospodnetic <
> [email protected]> wrote:
> 
>> John, you should have a look at Zoie.  I just finished adding LinkedIn's
>> case study about Zoie to Lucene in Action 2, so this is fresh in my mind.
> 
> :)
>>
> 
> Yep, Zoie ( http://zoie.googlecode.com ) will handle the server restart
> part, in that while yes, you lose what is in RAM, Zoie keeps track of an
> "index version" on disk alongside the Lucene index which it uses to decide
> where it must reindex from to "catch up" if it there have been incoming
> indexing events while the server was out of commission.
> 
> Zoie does not support multiple servers using the same index, because each
> zoie instance has IndexWriter instances, and you'll get locking problems
> trying to do that.  You could have one Zoie instance effectively as the
> "master/writer/realtime reader", and a bunch of raw Lucene "slaves" which
> could read off of that index, but as you say, could not get access to the
> RAMDirectory information until it was flushed to disk.
> 
> Why do you need a "cluster" of servers hitting the same index?  Are they
> different applications (with different search logic, so they need to be
> different instances), or is it just to try and utilize your hardware
> efficiently?  If it's for performance reasons, you might find you get
> better
> use of your CPU cores by just sharding your one index into smaller ones,
> each having their own Zoie instance, and putting a "broker" on top of them
> searching across all and mergesorting the results.  Often even this isn't
> necessary, because Zoie will be opening the disk-backed IndexReader in
> readonly mode, and thus all the synchronized blocks are gone, and one
> single
> Zoie instance will easily saturate your cpu cores by simple
> multi-threading
> by your appserver.
> 
> If you really needed to do many different kinds of writes (from different
> applications) and also have applications not involved in the writing also
> seeing (in real-time) these writes, then you could still do it with Zoie,
> but it would take some interesting architectural juggling (write your own
> StreamDataProvider class which takes input from a variety of sources and
> merges them together to feed to one Zoie instance, then a broker on top of
> zoie which serves out IndexReaders to different applications living on top
> which can wrap them up in their own business logic as they saw fit... as
> long as it was ok to have all the applications in the same JVM, of
> course).
> 
>   -jake
> 
> 
>>
>>  Otis
>> --
>> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>>
>>
>>
>> ----- Original Message ----
>> > From: jchang <[email protected]>
>> > To: [email protected]
>> > Sent: Tue, January 12, 2010 6:10:56 PM
>> > Subject: Lucene 2.9.0 Near Real Time Indexing and Service
>> Crashes/restarts
>> >
>> >
>> > Lucene 2.9.0 has near real time indexing, writing to a RAMDir which
>> gets
>> > flushed to disk when you do a search.
>> >
>> > Does anybody know how this works out with service restarts (both
>> orderly
>> > shutdown and a crash)?  If the service goes down while indexed items
>> are
>> in
>> > RAMDir but not on disk, are they lost?  Or is there some kind of log
>> > recovery?
>> >
>> > Also, does anybody know the impact of this which clustered lucene
>> servers?
>> > If you have numerous servers running off one index, I assume there is
>> no
>> way
>> > for the other services to pick up the newly indexed items until they
>> are
>> > flushed to disk, correct?  I'd be happy if that is not so, but I
>> suspect
>> it
>> > is so.
>> >
>> > Thanks,
>> > John
>> > --
>> > View this message in context:
>> >
>> http://old.nabble.com/Lucene-2.9.0-Near-Real-Time-Indexing-and-Service-Crashes-restarts-tp27136539p27136539.html
>> > Sent from the Lucene - Java Developer mailing list archive at
>> Nabble.com.
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Lucene-2.9.0-Near-Real-Time-Indexing-and-Service-Crashes-restarts-tp27136539p27148813.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restarts

Reply via email to