Re: [jira] Commented: (JCR-169) Make Jackrabbit clusterable

2006-09-01 Thread Ian Boston

Marcel Reutegger wrote:

Ian Boston wrote:
So, if you have 50x200MB of Lucene index... for example and wanted 
that to be accessible in a cluster environment, would Jackrabbit be a 
good place to put those segments ?


just to clarify, would this lucene index be 'application data', which is 
stored like regular content through the JCR api? Or do you mean the 
jackrabbit internal lucene segments?


This is application data from JCR's point of view.



The big killer for Lucene is the ability to seek efficiently on the 
central blob (I think), but presumably by choosing the right Binary 
storage strategy that comes partially for free ?


Jackrabbit always copies a binary to a temp file or into memory when the 
property value is accessed. That is, the seek would always be local. But 
as I already mentioned in another thread, JCR does not support random 
access on binary properties. A binary property returns a plain InputStream.




understood.


If this is the case, I could replace my, slightly odd, segment 
distribution mechanism with Jackrabbit.


yes, you certainly get a couple of goodies you otherwise don't have. 
e.g. observation on the index files ;)



Last question,
Is JCR-169 being actively worked on ?


It doesn't have a high priority, but we are working on it on a 
conceptual level. discussions during coffee breaks, etc. Basically how 
the problems stated in JCR-169 can be solved and what needs to be 
changed in the core to implement the feature blocks in a clustered 
environment.


Is there an area where another pair of hands would help... I would 
like to be able to deploy Jackrabbit in a cluster.


One major area is how changes from one cluster node are distributed to 
other cluster nodes. Giota implemented something like a prototype, but 
I'm not sure what the current state is. See also this discussion: 
http://thread.gmane.org/gmane.comp.apache.jackrabbit.devel/6935




Thank you for the pointer. I'll read it. There has been some use of 
JGroups for cluster wide distribution of events but it might not 
make sense here.



Or any other area mentioned in JCR-169, you can simply pick one ;)



Ok, when I get pressure to make it work in a cluster, I'll jump in.
Ian



regards
 marcel




Re: [jira] Commented: (JCR-169) Make Jackrabbit clusterable

2006-09-01 Thread Marcel Reutegger

Ian Boston wrote:
So, if you have 50x200MB of Lucene index... for example and wanted 
that to be accessible in a cluster environment, would Jackrabbit be a 
good place to put those segments ?


just to clarify, would this lucene index be 'application data', which 
is stored like regular content through the JCR api? Or do you mean the 
jackrabbit internal lucene segments?


The big killer for Lucene is the ability to seek efficiently on the 
central blob (I think), but presumably by choosing the right Binary 
storage strategy that comes partially for free ?


Jackrabbit always copies a binary to a temp file or into memory when 
the property value is accessed. That is, the seek would always be 
local. But as I already mentioned in another thread, JCR does not 
support random access on binary properties. A binary property returns 
a plain InputStream.


If this is the case, I could replace my, slightly odd, segment 
distribution mechanism with Jackrabbit.


yes, you certainly get a couple of goodies you otherwise don't have. 
e.g. observation on the index files ;)



Last question,
Is JCR-169 being actively worked on ?


It doesn't have a high priority, but we are working on it on a 
conceptual level. discussions during coffee breaks, etc. Basically how 
the problems stated in JCR-169 can be solved and what needs to be 
changed in the core to implement the feature blocks in a clustered 
environment.


Is there an area where another pair of hands would help... I would like 
to be able to deploy Jackrabbit in a cluster.


One major area is how changes from one cluster node are distributed to 
other cluster nodes. Giota implemented something like a prototype, but 
I'm not sure what the current state is. See also this discussion: 
http://thread.gmane.org/gmane.comp.apache.jackrabbit.devel/6935


Or any other area mentioned in JCR-169, you can simply pick one ;)

regards
 marcel


Re: [jira] Commented: (JCR-169) Make Jackrabbit clusterable

2006-09-01 Thread Ian Boston

Marcel,
Im replying to the list rather than Jira, since this is OT wrt JCR-169.

So, if you have 50x200MB of Lucene index... for example and wanted 
that to be accessible in a cluster environment, would Jackrabbit be a 
good place to put those segments ?


The big killer for Lucene is the ability to seek efficiently on the 
central blob (I think), but presumably by choosing the right Binary 
storage strategy that comes partially for free ?


If this is the case, I could replace my, slightly odd, segment 
distribution mechanism with Jackrabbit.



Last question,
Is JCR-169 being actively worked on ?
Is there an area where another pair of hands would help... I would like 
to be able to deploy Jackrabbit in a cluster.


Ian


Marcel Reutegger (JIRA) wrote:
[ http://issues.apache.org/jira/browse/JCR-169?page=comments#action_12432083 ] 

Marcel Reutegger commented on JCR-169:

--

Ian, thanks a lot for your comments.

Here are my current thoughts on clustering the search index in jackrabbit:

I think the prefered approach is to put the index into the repository itself. 
See: http://article.gmane.org/gmane.comp.apache.jackrabbit.devel/8530 and 
following messages
This would also allow us to distribute index updates to cluster nodes using the 
repository internal observation mechanism. e.g. the update of a deleted 
documents file or new index segments.


I found the best indexing strategy was to have local copies of segments, stored 
centrally as masters.


I agree. Specifically the design of lucene where index files are only created 
but never modified supports this approach very nicely.


Im the search application, speed of update of segments is not that critical,
you probably have a different requirement in JCR. 


JCR is more restrictive in that respect, at least if we want to be compliant 
with the specification. As soon as a node is created in the workspace it must 
be searchable using a query. For most real life systems this is not a hard 
requirement though. E.g. when a document is added to a repository, it usually 
doesn't matter if it is retrievable by query only after a couple of seconds and 
not right away.



Make Jackrabbit clusterable
---

Key: JCR-169
URL: http://issues.apache.org/jira/browse/JCR-169
Project: Jackrabbit
 Issue Type: New Feature
 Components: core
   Reporter: Marcel Reutegger
   Priority: Minor

This jira issue discusses the technical implications on the current design of 
Jackrabbit to introduce clustering.
Particularly the following areas require thorough investigation:
- SharedItemStateManager and its cache
- cache integrity
- cache design: look aside, write through?
- hook for distributed cache, interface?
- isolation level
- transaction integrity within Jackrabbit, interaction with transient layer
- VirtualItemStateProvider
- same strategy as SharedItemStateManager?
- Search index
- single or per cluster node index?
- Observation
Please state more areas if needed.






[jira] Commented: (JCR-169) Make Jackrabbit clusterable

2006-09-01 Thread Marcel Reutegger (JIRA)
[ 
http://issues.apache.org/jira/browse/JCR-169?page=comments#action_12432083 ] 

Marcel Reutegger commented on JCR-169:
--

Ian, thanks a lot for your comments.

Here are my current thoughts on clustering the search index in jackrabbit:

I think the prefered approach is to put the index into the repository itself. 
See: http://article.gmane.org/gmane.comp.apache.jackrabbit.devel/8530 and 
following messages
This would also allow us to distribute index updates to cluster nodes using the 
repository internal observation mechanism. e.g. the update of a deleted 
documents file or new index segments.

> I found the best indexing strategy was to have local copies of segments, 
> stored centrally as masters.

I agree. Specifically the design of lucene where index files are only created 
but never modified supports this approach very nicely.

> Im the search application, speed of update of segments is not that critical,
> you probably have a different requirement in JCR. 

JCR is more restrictive in that respect, at least if we want to be compliant 
with the specification. As soon as a node is created in the workspace it must 
be searchable using a query. For most real life systems this is not a hard 
requirement though. E.g. when a document is added to a repository, it usually 
doesn't matter if it is retrievable by query only after a couple of seconds and 
not right away.


> Make Jackrabbit clusterable
> ---
>
> Key: JCR-169
> URL: http://issues.apache.org/jira/browse/JCR-169
> Project: Jackrabbit
>  Issue Type: New Feature
>  Components: core
>Reporter: Marcel Reutegger
>Priority: Minor
>
> This jira issue discusses the technical implications on the current design of 
> Jackrabbit to introduce clustering.
> Particularly the following areas require thorough investigation:
> - SharedItemStateManager and its cache
> - cache integrity
> - cache design: look aside, write through?
> - hook for distributed cache, interface?
> - isolation level
> - transaction integrity within Jackrabbit, interaction with transient 
> layer
> - VirtualItemStateProvider
> - same strategy as SharedItemStateManager?
> - Search index
> - single or per cluster node index?
> - Observation
> Please state more areas if needed.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (JCR-169) Make Jackrabbit clusterable

2006-08-31 Thread Ian Boston (JIRA)
[ 
http://issues.apache.org/jira/browse/JCR-169?page=comments#action_12431999 ] 

Ian Boston commented on JCR-169:



Search -
I assume this is the lucene indexes ?
If you havent got to it already

Im interested in this Jira becuase, I also want to run in a DB cluster. 
I've just finished implementing a search engine based on Lucene in such a 
cluster, where the only thing shared is the DB.  Its in production in one or 2 
places with ~10G of index segments on 3+ cluster nodes, the impl is not that 
great (compared to nutch) but here is what I found on the way.

Lucene segments in the DB only work in Oracle (and perhapse other DB's), where 
there is reasonable Seek performance on blobs. MySQL (for instance) is hopeless 
at BLOB seeks. Indexes on a shared filesystem generate lots of network traffic. 
NDFS (the MapReduce file system) is great but a complete pain to setup, as is a 
rsync based strategy for segment distribution. I found the best indexing 
strategy was to have local copies of segments, stored centrally as masters. 
When a node in the cluster perfoms an index operation, a new master segment is 
created and the other nodes sync the master segments. 

Im the search application, speed of update of segments is not that critical, 
you probably have a different requirement in JCR.

The only point in this strategy that requires a distributed lock is when 
segments are merged (which has to be done to reduce the number of open files) 
or when documents are deleted from the lucene index.

As I said the strategy works in production for 50x200Mb segments on 3+ cluster 
nodes, without excessive network traffic. If there was an easy NDFS setup that 
could be coded in Java, that would probably be a better solution.

The project is www.sakaiproject.org where I would also like to use 
Jackrabbit :)  

> Make Jackrabbit clusterable
> ---
>
> Key: JCR-169
> URL: http://issues.apache.org/jira/browse/JCR-169
> Project: Jackrabbit
>  Issue Type: New Feature
>  Components: core
>Reporter: Marcel Reutegger
>Priority: Minor
>
> This jira issue discusses the technical implications on the current design of 
> Jackrabbit to introduce clustering.
> Particularly the following areas require thorough investigation:
> - SharedItemStateManager and its cache
> - cache integrity
> - cache design: look aside, write through?
> - hook for distributed cache, interface?
> - isolation level
> - transaction integrity within Jackrabbit, interaction with transient 
> layer
> - VirtualItemStateProvider
> - same strategy as SharedItemStateManager?
> - Search index
> - single or per cluster node index?
> - Observation
> Please state more areas if needed.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira