We're already using Solr in our implementation (eg: 
http://me.edu.au/public/search?q=blog&category=BLOG or 
http://me.edu.au/public/search?q=blog&category=BLOG_ENTRY)

We have Solr deployed in an external web app, not inside Roller itself. We have 
an index request table, which contains index requests for inserts, updates & 
deletes. That table is built using database triggers, and read using a batch 
process which sends updates to Solr.

I have two concerns with the linked proposal. Neither are deal breakers, though.

1) The lack of transactional guarantees with the combination of the database 
updates and the webservice call means that there is potential for inconsistency 
between the search index and the database. There are work arounds for that 
(keeping the transaction open and somehow passing the connection to the 
listeners so they can roll it back in the event of a problem), but they do have 
performance penalties.

2) The performance of Solr on updates is much better if you can batch your 
updates. We found that a batchsize of 20 (solr updates) gave up to 90% better 
performance than using single updates (even when commits are only sent 
periodically).

I'd propose a slight modification of the proposal:
1) Anything which needs to be indexed also writes a row to an index request 
table
2) We use the listener as proposed to fire off an index request, which reads 
any unindexed rows from that table, sends those updates to Solr, and updates 
the table to show they have been successfully indexed.
3) On startup, we use similar code to index anything in that table which hasn't 
already been indexed.

This proposal doesn't address my issue (2) above (although there are some easy 
optimizations to do that), but it does mean we'd avoid transactional problems.

I'm happy to donate the code we have to this if people are interested (the fact 
that it relies on triggers will pretty much rule it out as it is as a general 
solution, but there might be some useful code there)

Nick

-----Original Message-----
From: Dave [mailto:[EMAIL PROTECTED]
Sent: Thursday, 11 December 2008 12:14 AM
To: [email protected]
Subject: Proposal - Clusterable Search Via Solr

Roller uses Lucene for search and Lucene stores its search index on
disk. So, if you have multiple Roller instances running you can either
1) have both Lucene instances write to the same disk file which will
fail 2) have two inconsistent search indexes which will be extremely
irritating or 3) turn off Roller's build-in search and use some
external spider.

For those who are not happy with those choices, I offer this proposal
to use a) Apache Solr and b) some improvements to Roller's plug-in
infrastructure to enable a cluster-able search implementation in
Roller.

   http://cwiki.apache.org/confluence/x/_5kB

Feedback is welcome.

- Dave

IMPORTANT: This e-mail, including any attachments, may contain private or 
confidential information. If you think you may not be the intended recipient, 
or if you have received this e-mail in error, please contact the sender 
immediately and delete all copies of this e-mail. If you are not the intended 
recipient, you must not reproduce any part of this e-mail or disclose its 
contents to any other party. This email represents the views of the individual 
sender, which do not necessarily reflect those of Education.au except where the 
sender expressly states otherwise. It is your responsibility to scan this email 
and any files transmitted with it for viruses or any other defects. 
education.au limited will not be liable for any loss, damage or consequence 
caused directly or indirectly by this email.

Reply via email to