Hi Ian,

No worries - good discussion.

I should point out though that my reply to Davide was based on a comparison of 
the current design vs the Jackrabbit 2 design (in which indexes were stored 
locally). Maybe I misunderstood Davide’s comment.

I will split my answer to your mail in 2 parts:


>
>Full text extraction should be separated from indexing, as the DS blobs are
>immutable, so is the full text. There is code to do this in the Oak
>indexer, but it's not used to write to the DS at present. It should be done
>in a Job, distributed to all nodes, run only once per item. Full text
>extraction is hugely expensive.

My understanding is that Oak currently:
A) runs full text extraction in a separate thread (separate form the “other” 
indexer)
B) runs it only once per cluster
If that is correct then the difference to what you mention above would be that 
you would like the FT indexing not be pinned to one instance but rather be 
distributed, say round-robin.
Right?


>Building the same index on every node doesn't scale for the reasons you
>point out, and eventually hits a brick wall.
>http://lucene.apache.org/core/6_1_0/core/org/apache/lucene/codecs/lucene60/package-summary.html#Limitations.
>(Int32 on Document ID per index). One of the reasons for the Hybrid
>approach was the number of Oak documents in some repositories will exceed
>that limit.

I am not sure what you are arguing for with this comment…
It sounds like an argument in favour of the current design - which is probably 
not what you mean… Could you explain, please?


Thanks!
Michael

Reply via email to