Re: Distributed Indexes

Ruslan Sivak Mon, 11 Feb 2008 12:33:57 -0800

Basically the index is big is because there is a large number ofdocuments, but each individual document is very small. There is also alot of redundancy, which, I believe is also why the index size is fairlysmall.Basically I am using the index to store the user's profile information,and then using the similarity search to find similar profiles. Theupdate rate can be fairly high, depending on how many users decide tocreate/update their profiles at the same time.The problem arises when many people decide to update their profiles, andthe windows file replication service doesn't get a chance to synchronizethe folders between updates. (It's possible for two people to makeupdates at the same time, on different servers).I'm thinking of maybe subclassing a Directory or FSDirectory andoverriding it to read and write from a database. Would this work, or isthere a better solution?


Russ




Grant Ingersoll wrote:

Solr has a strategy using rsync that makes it relatively easy to copyan index around to other servers. It uses rsync to just copy thediffs, so you could easily mirror this in your application.
There is no SQL backend for Lucene, but at 4mb you could certainlyserialize it as a blob to a SQL db, but I don't see how that wouldmake it faster.
Also, you said it takes a long time to create a 4mb index. Does thismean you are doing something really, really complex during analysis?I guess what I am missing, and I think others have hinted at, is thebig picture isn't quite clear in our minds, because the size of theindex seems almost trivially small in Lucene terms, so we would thinkthat a) It would be really fast to create the index (seconds, notminutes) and b) such a small index could be easily held entirely inmemory and should easily handle a very, very, high query rate givenreasonable hardware, which it sounds like you have. The other piecethat doesn't fit in my mind is it sounds like you have a fairly highupdate rate since you are getting writes before your 4mb index can becopied on a local network, right? This implies that you must alsohave a lot of deletes otherwise your index would be growingsignificantly.
Thus, more details on what you are doing, how you are creating yourindex, how your CF app talks to Lucene, etc. would be good.
-Grant


On Feb 10, 2008, at 12:55 PM, Ruslan Sivak wrote:
So nobody's run into anything like this before? The need to sharethe index between many copies of the app possibly running on multipleservers?
Russ

Ruslan Sivak wrote:
The app does other things then search the index. I'm basicallyusing ColdFusion for the website and have four instances running ontwo servers for load balancing. Each app does the searches, and thesearch times are small, the index is small, but it takes a long timeto fully create the index (several minutes), and I would like theindex to always be up to date (which is why i replicate the changes).I basically cache the index for several minutes in a RamDirectory,which works quite well for performance. If I could store the indexin a SQL Table or something, I can have a single place where theindex lives and atomic updates.Is there a SQL Backend for the index, or should I just take theRamDirectory, serialize it and store it as a BLOB?
Russ

Erick Erickson wrote:
With an index that small, I wonder why you bother with so many copies?
What kind of load are you hitting it with and how complex are thequeries?
Because unless you have *very* high query rate, I'd look at why myqueries
were
taking so long before complexifying things this way.

Best
Erick

On Feb 7, 2008 4:52 PM, Ruslan Sivak <[EMAIL PROTECTED]> wrote:
My index is only 4mb.  Is there a SQL backend for Lucene?

Russ

Michael McCandless wrote:
If you're able to tell Windows FRS which specific files to copy,then
SnapshotDeletionPolicy (in 2.3) should work for this.

It basically protects a consistent snapshot of your index, ensuring
those files will not be deleted, while not blocking furtherupdates to
the index.

Mike

Ruslan Sivak wrote:
I'm wondering if this is a problem that lucene users have already
tackled.  I have four copies of the application using a lucene
index.  They are located on two physical servers with two copies on
each server accessing two copies of the lucene index. I useWindowsFRS (File Replication Service) to replicate the index betweenthe two
servers.
Things work well most of the time, but sometimes, I believe under
load, the index doesn't get a chance to propagate before another
write takes place and it gets corrupted.
What would you recommend I use to keep the index in sync betweenthe
four copies of the app?

Russ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Distributed Indexes

Reply via email to