[ 
http://issues.apache.org/jira/browse/JCR-169?page=comments#action_12431999 ] 
            
Ian Boston commented on JCR-169:
--------------------------------


Search -
I assume this is the lucene indexes ?
If you havent got to it already....

Im interested in this Jira becuase, I also want to run in a DB cluster. 
I've just finished implementing a search engine based on Lucene in such a 
cluster, where the only thing shared is the DB.  Its in production in one or 2 
places with ~10G of index segments on 3+ cluster nodes, the impl is not that 
great (compared to nutch) but here is what I found on the way.

Lucene segments in the DB only work in Oracle (and perhapse other DB's), where 
there is reasonable Seek performance on blobs. MySQL (for instance) is hopeless 
at BLOB seeks. Indexes on a shared filesystem generate lots of network traffic. 
NDFS (the MapReduce file system) is great but a complete pain to setup, as is a 
rsync based strategy for segment distribution. I found the best indexing 
strategy was to have local copies of segments, stored centrally as masters. 
When a node in the cluster perfoms an index operation, a new master segment is 
created and the other nodes sync the master segments. 

Im the search application, speed of update of segments is not that critical, 
you probably have a different requirement in JCR.

The only point in this strategy that requires a distributed lock is when 
segments are merged (which has to be done to reduce the number of open files) 
or when documents are deleted from the lucene index.

As I said the strategy works in production for 50x200Mb segments on 3+ cluster 
nodes, without excessive network traffic. If there was an easy NDFS setup that 
could be coded in Java, that would probably be a better solution.

The project is www.sakaiproject.org.... where I would also like to use 
Jackrabbit :)  

> Make Jackrabbit clusterable
> ---------------------------
>
>                 Key: JCR-169
>                 URL: http://issues.apache.org/jira/browse/JCR-169
>             Project: Jackrabbit
>          Issue Type: New Feature
>          Components: core
>            Reporter: Marcel Reutegger
>            Priority: Minor
>
> This jira issue discusses the technical implications on the current design of 
> Jackrabbit to introduce clustering.
> Particularly the following areas require thorough investigation:
> - SharedItemStateManager and its cache
>     - cache integrity
>     - cache design: look aside, write through?
>     - hook for distributed cache, interface?
>     - isolation level
>     - transaction integrity within Jackrabbit, interaction with transient 
> layer
> - VirtualItemStateProvider
>     - same strategy as SharedItemStateManager?
> - Search index
>     - single or per cluster node index?
> - Observation
> Please state more areas if needed.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to