Oak Indexing. Was Re: Property index replacement / evolution

Ian Boston Thu, 11 Aug 2016 00:30:08 -0700

Hi Michael,
It's probably rude of me to reply to this as you addressed it to Davide not
me.
I have changed the subject line as you said...
"This discussion is only about balancing property indexes vs Lucene indexes"

although, the content remains relevant to the original thread.

On 10 August 2016 at 20:44, Michael Marth <mma...@adobe.com> wrote:

> Hi Davide,
>
> My POV:
> Storing the indexes within the repo itself allows for operational
> simplicity. In particular: it allows to create a backup of the persistence
> (including the indexes) in a consistent form - without having to stop
> writes to the repo. In JR2 it is not possible to create a consistent backup
> of nodes and indexes without stopping writes to the repo (to my knowledge
> at least).
> You also extend your question to “what would happen if separate cluster
> nodes would maintain their own indexes (on local/private disc)?”. Two
> things:
> 1. Each cluster node would have to process full text extraction - i.e.
> Computationally expensive
>

Full text extraction should be separated from indexing, as the DS blobs are
immutable, so is the full text. There is code to do this in the Oak
indexer, but it's not used to write to the DS at present. It should be done
in a Job, distributed to all nodes, run only once per item. Full text
extraction is hugely expensive.

> 2. Really bad: if a new node joins the cluster then that node would have
> to re-index the full repo.
>

Building the same index on every node doesn't scale for the reasons you
point out, and eventually hits a brick wall.
http://lucene.apache.org/core/6_1_0/core/org/apache/lucene/codecs/lucene60/package-summary.html#Limitations.
(Int32 on Document ID per index). One of the reasons for the Hybrid
approach was the number of Oak documents in some repositories will exceed
that limit.

>
> IMHO the current design (to store indexes in the repo itself) is totally
> the right approach.
>

I am reticent to disagree with you, but I feel I have no option, based on
research, history and first hand experience over the past 10 years.

Storing indexes in a repo is what Compass did from 2004 onwards, until
after the third version they gave up trying to build a scalable and near
real time search engine. Version 4 was a rerwite that became ElasticSearch
0.4.0. The history is documented here
https://en.wikipedia.org/wiki/Elasticsearch and was presented at Berlin
Buzwords in 2010 with a detailed description of why each approach fails. I
have shared this information before. I am not sharing it to confront. I am
sharing it because it pains me to see Oak repeating history. I don't feel I
can stand by and watch in silence.

If Oak does not want to use ES as a library, then learn from the history as
it addresses your concerns (1,2, + brick wall) and those of Davide, and
satisfies the many of the other issues potentially eliminating property
indexes completely. It will however, only ever be as NRT as the root
document commit period (1s), well above the 100ms data latency a model like
used by ES delivers under production load.

 IMHO, the Hybrid approach being proposed is a step along the same history
that Compass started treading in 2004. It is an innovative solution to a
constrained problem space.

Sorry if I sound like a broken record. I did exactly what Oak has done/is
doing in 2006 onwards but without a vast user base was able to be more
agile.

Apache is about doing, not standing by, about fact not fiction, about
evidence and reasoned argument. If there is any interest, I have an Oak PoC
somewhere that ports the Lucene index plugin to use embedded ES instances,
1 per VM as an embedded ES cluster. It's not complete as I gave up on it
when I realised data latency would be fixed by the Oak root document. My
interest was proper real time indexing over the cluster.

Best Regards
Ian

> This discussion is only about balancing property indexes vs Lucene indexes
>
> Michael
>
>
>
>
> On 10/08/16 15:11, "Davide Giannella" <dav...@apache.org> wrote:
>
> >On 09/08/2016 13:18, Ian Boston wrote:
> >> Alternatively, move the indexes so that a sync property index update
> >> doesn't perform a conditional change to the  global root document ? ( A
> new
> >> thread would be required to discuss this if worth talking about.)
> >
> >I'm stubborn and maybe even slow in learning, but again I ask myself:
> >why are we storing the indexes in the repository itself?
> >
> >I was not part of the original discussion around this; but frankly I
> >would have expected to have the indexes stored separately from the
> >repository. Let's say on the file system. Something like JR2 where it
> >was even possible to delete a directory and all the indexes were
> >re-generated from scratch.
> >
> >What do we loose if we would be moving the indexes outside of the
> >repository? Which means each AEM node will have its own index(es).
> >
> >Cheers
> >Davide
> >
> >
>

Oak Indexing. Was Re: Property index replacement / evolution

Reply via email to