Couple of points around the motivation, target usecase around Hybrid Indexing and Oak indexing in general.
Based on my understanding of various deployments. Any application based on Oak has 2 type of query requirements QR1. Application Query - These mostly involve some property restrictions and are invoked by code itself to perform some operation. The property involved here in most cases would be sparse i.e. present in small subset of whole repository content. Such queries need to be very fast and they might be invoked very frequently. Such queries should also be more accurate and result should not lag repository state much. QR2. User provided query - These queries would consist of both or either of property restriction and fulltext constraints. The target nodes may form majority part of overall repository content. Such queries need to be fast but given user driven need not be very fast. Note that speed criteria is very subjective and relative here. Further Oak needs to support deployments 1. On single setup - For dev, prod on SegmentNodeStore 2. Cluster Setup on premise 3. Deployment in some DataCenter So Oak should enable deployments where for smaller setups it does not require any thirdparty system while still allow plugging in a dedicate system like ES/Solr if need arises. So both usecases need to be supported. And further even if it has access to such third party server it might be fine to rely on embedded Lucene for #QR1 and just delegate queries under #QR2 to remote. This would ensure that query results are still fast for usage falling under #QR1. Hybrid Index Usecase ----------------------------- So far for #QR1 we only had property indexes and to an extent Lucene based property index where results lag repository state and lag might be significant depending on load. Hybrid index aim to support queries under #QR1 and can be seen as replacement for existing non unique property indexes. Such indexes would have lower storage requirement and would not put much load on remote storage for execution. Its not meant as a replacement for ES/Solr but then intends to address different type of usage Very large Indexes ------------------------- For deployments having very large repository Solr or ES based indexes would be preferable and there oak-solr can be used (some day oak-es!) So in brief Oak should be self sufficient for smaller deployment and still allow plugging in Solr/ES for large deployment and there also provide a choice to admin to configure a sub set of index for such usage depending on the size. Chetan Mehrotra On Thu, Aug 11, 2016 at 1:59 PM, Ian Boston <i...@tfd.co.uk> wrote: > Hi, > > On 11 August 2016 at 09:14, Michael Marth <mma...@adobe.com> wrote: > >> Hi Ian, >> >> No worries - good discussion. >> >> I should point out though that my reply to Davide was based on a >> comparison of the current design vs the Jackrabbit 2 design (in which >> indexes were stored locally). Maybe I misunderstood Davide’s comment. >> >> I will split my answer to your mail in 2 parts: >> >> >> > >> >Full text extraction should be separated from indexing, as the DS blobs >> are >> >immutable, so is the full text. There is code to do this in the Oak >> >indexer, but it's not used to write to the DS at present. It should be >> done >> >in a Job, distributed to all nodes, run only once per item. Full text >> >extraction is hugely expensive. >> >> My understanding is that Oak currently: >> A) runs full text extraction in a separate thread (separate form the >> “other” indexer) >> B) runs it only once per cluster >> If that is correct then the difference to what you mention above would be >> that you would like the FT indexing not be pinned to one instance but >> rather be distributed, say round-robin. >> Right? >> > > > Yes. > > >> >> >> >Building the same index on every node doesn't scale for the reasons you >> >point out, and eventually hits a brick wall. >> >http://lucene.apache.org/core/6_1_0/core/org/apache/ >> lucene/codecs/lucene60/package-summary.html#Limitations. >> >(Int32 on Document ID per index). One of the reasons for the Hybrid >> >approach was the number of Oak documents in some repositories will exceed >> >that limit. >> >> I am not sure what you are arguing for with this comment… >> It sounds like an argument in favour of the current design - which is >> probably not what you mean… Could you explain, please? >> > > I didn't communicate that very well. > > Currently Lucene (6.1) has a limit of Int32 to the number of documents it > can store in an index, IIUC There is a long term desire to increase that > but using Int64 but no long term commitment as its probably significant > work given arrays in Java are indexed with Int32. > > The Hybrid approach doesn't help the potential Lucene brick wall, but one > motivation for looking at it was the number of Oak Documents including > those under /oak:index which is, in some cases, approaching that limit. > > > >> >> >> Thanks! >> Michael >>