Re: [jira] Commented: (JCR-169) Make Jackrabbit clusterable
Marcel Reutegger wrote: Ian Boston wrote: So, if you have 50x200MB of Lucene index... for example and wanted that to be accessible in a cluster environment, would Jackrabbit be a good place to put those segments ? just to clarify, would this lucene index be 'application data', which is stored like regular content through the JCR api? Or do you mean the jackrabbit internal lucene segments? This is application data from JCR's point of view. The big killer for Lucene is the ability to seek efficiently on the central blob (I think), but presumably by choosing the right Binary storage strategy that comes partially for free ? Jackrabbit always copies a binary to a temp file or into memory when the property value is accessed. That is, the seek would always be local. But as I already mentioned in another thread, JCR does not support random access on binary properties. A binary property returns a plain InputStream. understood. If this is the case, I could replace my, slightly odd, segment distribution mechanism with Jackrabbit. yes, you certainly get a couple of goodies you otherwise don't have. e.g. observation on the index files ;) Last question, Is JCR-169 being actively worked on ? It doesn't have a high priority, but we are working on it on a conceptual level. discussions during coffee breaks, etc. Basically how the problems stated in JCR-169 can be solved and what needs to be changed in the core to implement the feature blocks in a clustered environment. Is there an area where another pair of hands would help... I would like to be able to deploy Jackrabbit in a cluster. One major area is how changes from one cluster node are distributed to other cluster nodes. Giota implemented something like a prototype, but I'm not sure what the current state is. See also this discussion: http://thread.gmane.org/gmane.comp.apache.jackrabbit.devel/6935 Thank you for the pointer. I'll read it. There has been some use of JGroups for cluster wide distribution of events but it might not make sense here. Or any other area mentioned in JCR-169, you can simply pick one ;) Ok, when I get pressure to make it work in a cluster, I'll jump in. Ian regards marcel
Re: [jira] Commented: (JCR-169) Make Jackrabbit clusterable
Ian Boston wrote: So, if you have 50x200MB of Lucene index... for example and wanted that to be accessible in a cluster environment, would Jackrabbit be a good place to put those segments ? just to clarify, would this lucene index be 'application data', which is stored like regular content through the JCR api? Or do you mean the jackrabbit internal lucene segments? The big killer for Lucene is the ability to seek efficiently on the central blob (I think), but presumably by choosing the right Binary storage strategy that comes partially for free ? Jackrabbit always copies a binary to a temp file or into memory when the property value is accessed. That is, the seek would always be local. But as I already mentioned in another thread, JCR does not support random access on binary properties. A binary property returns a plain InputStream. If this is the case, I could replace my, slightly odd, segment distribution mechanism with Jackrabbit. yes, you certainly get a couple of goodies you otherwise don't have. e.g. observation on the index files ;) Last question, Is JCR-169 being actively worked on ? It doesn't have a high priority, but we are working on it on a conceptual level. discussions during coffee breaks, etc. Basically how the problems stated in JCR-169 can be solved and what needs to be changed in the core to implement the feature blocks in a clustered environment. Is there an area where another pair of hands would help... I would like to be able to deploy Jackrabbit in a cluster. One major area is how changes from one cluster node are distributed to other cluster nodes. Giota implemented something like a prototype, but I'm not sure what the current state is. See also this discussion: http://thread.gmane.org/gmane.comp.apache.jackrabbit.devel/6935 Or any other area mentioned in JCR-169, you can simply pick one ;) regards marcel
Re: [jira] Commented: (JCR-169) Make Jackrabbit clusterable
Marcel, Im replying to the list rather than Jira, since this is OT wrt JCR-169. So, if you have 50x200MB of Lucene index... for example and wanted that to be accessible in a cluster environment, would Jackrabbit be a good place to put those segments ? The big killer for Lucene is the ability to seek efficiently on the central blob (I think), but presumably by choosing the right Binary storage strategy that comes partially for free ? If this is the case, I could replace my, slightly odd, segment distribution mechanism with Jackrabbit. Last question, Is JCR-169 being actively worked on ? Is there an area where another pair of hands would help... I would like to be able to deploy Jackrabbit in a cluster. Ian Marcel Reutegger (JIRA) wrote: [ http://issues.apache.org/jira/browse/JCR-169?page=comments#action_12432083 ] Marcel Reutegger commented on JCR-169: -- Ian, thanks a lot for your comments. Here are my current thoughts on clustering the search index in jackrabbit: I think the prefered approach is to put the index into the repository itself. See: http://article.gmane.org/gmane.comp.apache.jackrabbit.devel/8530 and following messages This would also allow us to distribute index updates to cluster nodes using the repository internal observation mechanism. e.g. the update of a deleted documents file or new index segments. I found the best indexing strategy was to have local copies of segments, stored centrally as masters. I agree. Specifically the design of lucene where index files are only created but never modified supports this approach very nicely. Im the search application, speed of update of segments is not that critical, you probably have a different requirement in JCR. JCR is more restrictive in that respect, at least if we want to be compliant with the specification. As soon as a node is created in the workspace it must be searchable using a query. For most real life systems this is not a hard requirement though. E.g. when a document is added to a repository, it usually doesn't matter if it is retrievable by query only after a couple of seconds and not right away. Make Jackrabbit clusterable --- Key: JCR-169 URL: http://issues.apache.org/jira/browse/JCR-169 Project: Jackrabbit Issue Type: New Feature Components: core Reporter: Marcel Reutegger Priority: Minor This jira issue discusses the technical implications on the current design of Jackrabbit to introduce clustering. Particularly the following areas require thorough investigation: - SharedItemStateManager and its cache - cache integrity - cache design: look aside, write through? - hook for distributed cache, interface? - isolation level - transaction integrity within Jackrabbit, interaction with transient layer - VirtualItemStateProvider - same strategy as SharedItemStateManager? - Search index - single or per cluster node index? - Observation Please state more areas if needed.
[jira] Commented: (JCR-169) Make Jackrabbit clusterable
[ http://issues.apache.org/jira/browse/JCR-169?page=comments#action_12432083 ] Marcel Reutegger commented on JCR-169: -- Ian, thanks a lot for your comments. Here are my current thoughts on clustering the search index in jackrabbit: I think the prefered approach is to put the index into the repository itself. See: http://article.gmane.org/gmane.comp.apache.jackrabbit.devel/8530 and following messages This would also allow us to distribute index updates to cluster nodes using the repository internal observation mechanism. e.g. the update of a deleted documents file or new index segments. > I found the best indexing strategy was to have local copies of segments, > stored centrally as masters. I agree. Specifically the design of lucene where index files are only created but never modified supports this approach very nicely. > Im the search application, speed of update of segments is not that critical, > you probably have a different requirement in JCR. JCR is more restrictive in that respect, at least if we want to be compliant with the specification. As soon as a node is created in the workspace it must be searchable using a query. For most real life systems this is not a hard requirement though. E.g. when a document is added to a repository, it usually doesn't matter if it is retrievable by query only after a couple of seconds and not right away. > Make Jackrabbit clusterable > --- > > Key: JCR-169 > URL: http://issues.apache.org/jira/browse/JCR-169 > Project: Jackrabbit > Issue Type: New Feature > Components: core >Reporter: Marcel Reutegger >Priority: Minor > > This jira issue discusses the technical implications on the current design of > Jackrabbit to introduce clustering. > Particularly the following areas require thorough investigation: > - SharedItemStateManager and its cache > - cache integrity > - cache design: look aside, write through? > - hook for distributed cache, interface? > - isolation level > - transaction integrity within Jackrabbit, interaction with transient > layer > - VirtualItemStateProvider > - same strategy as SharedItemStateManager? > - Search index > - single or per cluster node index? > - Observation > Please state more areas if needed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (JCR-169) Make Jackrabbit clusterable
[ http://issues.apache.org/jira/browse/JCR-169?page=comments#action_12431999 ] Ian Boston commented on JCR-169: Search - I assume this is the lucene indexes ? If you havent got to it already Im interested in this Jira becuase, I also want to run in a DB cluster. I've just finished implementing a search engine based on Lucene in such a cluster, where the only thing shared is the DB. Its in production in one or 2 places with ~10G of index segments on 3+ cluster nodes, the impl is not that great (compared to nutch) but here is what I found on the way. Lucene segments in the DB only work in Oracle (and perhapse other DB's), where there is reasonable Seek performance on blobs. MySQL (for instance) is hopeless at BLOB seeks. Indexes on a shared filesystem generate lots of network traffic. NDFS (the MapReduce file system) is great but a complete pain to setup, as is a rsync based strategy for segment distribution. I found the best indexing strategy was to have local copies of segments, stored centrally as masters. When a node in the cluster perfoms an index operation, a new master segment is created and the other nodes sync the master segments. Im the search application, speed of update of segments is not that critical, you probably have a different requirement in JCR. The only point in this strategy that requires a distributed lock is when segments are merged (which has to be done to reduce the number of open files) or when documents are deleted from the lucene index. As I said the strategy works in production for 50x200Mb segments on 3+ cluster nodes, without excessive network traffic. If there was an easy NDFS setup that could be coded in Java, that would probably be a better solution. The project is www.sakaiproject.org where I would also like to use Jackrabbit :) > Make Jackrabbit clusterable > --- > > Key: JCR-169 > URL: http://issues.apache.org/jira/browse/JCR-169 > Project: Jackrabbit > Issue Type: New Feature > Components: core >Reporter: Marcel Reutegger >Priority: Minor > > This jira issue discusses the technical implications on the current design of > Jackrabbit to introduce clustering. > Particularly the following areas require thorough investigation: > - SharedItemStateManager and its cache > - cache integrity > - cache design: look aside, write through? > - hook for distributed cache, interface? > - isolation level > - transaction integrity within Jackrabbit, interaction with transient > layer > - VirtualItemStateProvider > - same strategy as SharedItemStateManager? > - Search index > - single or per cluster node index? > - Observation > Please state more areas if needed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira