Hey,

I've caught up with all mails in this thread, and would like to make
some general remarks. Admittedly, I do not yet work with oak and do
not yet know much about its indexing strategy/implementation,  but I
do know quite some details about the old JR2 index implementation,
about ES, about Lucene and about JCR in general.

I do agree with Ian that in the past every attempt to store the Lucene
index not near the code has failed. I think he forgot to mention
Lucandra :-). About 8 years ago Simon Willnauer was pretty explicit in
a talk with me about it: Bring the computation to the data with
Lucene, every other attempt will fail. I also talked with Jukka (5
years ago?) when he explained me the oak indexing setup. I asked him
about how this would work because bringing the data to the code
(during query execution) doesn't perform. Obviously Jukka was aware.

AFAIU, oak does have the Lucene segments from the storage (mongoDB)
locally. So it doesn't bring the data to the computation during query
execution. The Lucene data is local. In this sense, I think Ian's fear
is not correct wrt having the Lucene index not local (it is confusing,
it is stored externally, but when used, copied locally...that is at
least what I understand)

With respect to using ES (and sharding) embedded or not in oak, I
consider the crux of the requirement being well explained by Chetan:

QR1. Application Query - These mostly involve some property
restrictions and are invoked by code itself to perform some operation.
<snip/>

QR2. User provided query - These queries would consist of both or
either of property restriction and fulltext constraints. <snip/>

With ES (with sharding), the QR1 type queries will never be fast
enough. We (Hippo) have code that can result in hundreds of queries
for a single request (for example for every document in a folder show
the translations of the document). In JR2, simple queries return
within 1 ms (and faster). You'll never be able to deliver this with ES
(clustered with sharding). Only network latency is magnitudes higher.
Obviously I do *not* claim that ES has a worse Lucene implementation
than JR2 has. Quite surely the opposite, but the implementation serves
a very different purpose. It is like comparing a ConcurrentHashMap as
cache with a Terracotta cluster wide cache. Some use cases require the
one, some the other.

Also what I did not see being mentioned in this thread, is
authorization (aka fine-grained ACLs). If you include the ACL
requirements, using an ES index (with sharding) will become even more
problematic: How many query results to fetch if you don't know how
many are allowed to be read? What if you want to return 1.000 hits,
but the JCR user has only read access to about 1%. Fetch 100.000 hits
from ES? And then 100.000 more if you did not find 1.000 authorized?

In JR2, at Hippo we combine with every query also a 'Lucene
authorization query'. This authorization query easily becomes a nested
boolean query with hundreds of boolean queries nested. The only way
this performs is using a caching Lucene filter [1]. I doubt if this is
possible with ES (perhaps with a custom endpoint and some token that
maps to an authorization query). Either way, long story short, I think
ES serves different use cases much much better than JR2 or oak will
ever be able to do. At Hippo we store for example every visitor their
page request including meta data in ES to support trends on data. ES
is perfect for this. I'd never want to store this in a hierarchical
content structure, with versioning, with eventual consistency, with
ACL support, with support for moving of subtrees, etc : But it is
these features that imho make ES in turn unsuited for supporting the
QR1 type of queries for JCR.

AFAIC judge, the hybrid approach suggested by Chetan makes sense to
me. Next to that, support for ES to support QR2 type of queries make
sense (possibly with a delay because they are less application kind of
queries). However, I consider ES support more as an integration
feature, not a core oak requirement.

Some general other remarks:

Some mails were about that text extraction is expensive, and that this
validates having the index in the database. I don't fully agree. Text
extraction is only expensive for (some) binaries, most notably PDFs.
At Hippo we therefor store a sibling for the jcr:data, namely the
binary 'hippo:text'. If hippo:text is present, we do not extract the
jcr:data but use the hippo:text binary (which is the extracted text
and thus only needs to be extracted once). With this kind of approach,
text extraction also only happens once. This does not require an index
to be stored in the repository.

Some mention was made about JR2 that when a cluster node crashes, its
index might be corrupt. Perhaps when the node crashes because disk is
full, but otherwise, the index is in general not corrupt. Namely there
is a redo.log file on FS which contains the jcr nodes which are
indexed in the 'in memory index' which is not yet flushed to disk.

Some remark was made about bringing up a new cluster node requiring to
(re)build the entire index. This is partially true. From a shutdown
cluster node, you can copy the index and make sure when the new
cluster node starts up, its revision number is set equal to the
revision number of the cluster node at the time the index was copied.
For a local POC (because we want to more easily scale out in the
cloud), I've already have it working to create Lucene snapshots of a
running JR2 repository. Not that hard if you use an existing multi
reader in JR (that one contains also the in memory index) and flush
that one to file system. What we then also flush is the revision id of
the cluster node at that time. A new node can then start up with the
exported index and set the revision correct.

Regards Ard

[1] 
https://code.onehippo.org/cms-community/hippo-repository/blob/master/engine/src/main/java/org/hippoecm/repository/query/lucene/util/CachingMultiReaderQueryFilter.java

On Fri, Aug 12, 2016 at 8:23 AM, Chetan Mehrotra
<chetan.mehro...@gmail.com> wrote:
> On Thu, Aug 11, 2016 at 7:33 PM, Ian Boston <i...@tfd.co.uk> wrote:
>> That probably means the queue should only
>> contain pointers to Documents and only index the Document as retrieved. I
>> dont know if that can ever work.
>
> That would not work as what document look like across cluster node
> would wary and what is to be considered valid entries is also not
> defined at that level
>
>> Run a single thread on the master, that indexes into a co-located ES
> cluster.
>
> While keeping things simple that looks like the safe way
>
>> BTW, how does Hybrid manage to parallelise the indexing and maintain
> consistency ?
>
> Hybrid indexes does not affect async indexes. Under this each cluster
> node maintain there local indexes which only contain local changes
> [1]. These indexes are not aware about similar index on other cluster
> node. Further the indexes are supposed to only contain entry from last
> async indexing cycle. Older entries are purged [2]. The query would
> then be consulting both indexes (IndexSearcher backed via MultiReader
> , 1 reader from async index and 1 (or 2) from local index).
>
> Also note that QueryEngine would enforce and reevaluate the property
> restrictions. So even if index has an entry based on old state QE
> would filter it out if it does not match the criteria per current
> repository state. So aim here is to have index provide a super set of
> result set.
>
> In all this async index logic remains same (single threaded) and based
> on diff. So it would remain consistent with repository state
>
> Chetan Mehrotra
> [1] They might also contain entries which are determined based on
> external diff. Read [3] for details
> [2] Purge here is done my maintaining different local index copy for
> each async indexing cycle. At max only 2 indexes are retained and
> older indexes are removed. This keeps index small
> [3] 
> https://issues.apache.org/jira/browse/OAK-4412?focusedCommentId=15405340&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15405340



-- 
Hippo Netherlands, Oosteinde 11, 1017 WT Amsterdam, Netherlands
Hippo USA, Inc. 71 Summer Street, 2nd Floor Boston, MA 02110, United
states of America.

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Reply via email to