Hi Ian Thanks for the informative response. I can see how mapping Lucene implementation details and assumptions to a clustered storage can be challenging. So on TarMK having synchronous Lucene indexes should be fine, while on DocumentMK it could lead to a degradation of I/O and potentially a lot of commit conflicts/retries.
Separating text-extraction from indexing sounds interesting! Regards Julian On Wed, Nov 4, 2015 at 12:07 PM, Ian Boston <[email protected]> wrote: > Hi, > Slightly off topic response: > > With the current indexing scheme: (IIUC). > One factor is that with shared index files, indexing can only be performed > on a cluster leader, and for updates the lucene segments must be written to > the repository to be read by other instances in the cluster. That means a > hard lucene commit. If the indexing is sync, then that will mean a large > number of hard lucene commits, which generally leads to either latency or > lots of IO or lots of segments. Hence Async is more efficient. > > If all lucene indexing is performed locally and the segments are not > shared, sync indexing works without issue as updates can be written to a > write ahead log, then added to the index with a soft commit, and the wal > adjusted on periodic hard commits. local indexing is viable using the > current scheme in a standalone environment. > > text extraction should ideally happen as a 1 time operation on immutable > content bodies, the result being stored as metadata of the content body. > imho it should be a separate operation from index update which should only > deal with indexing properties, including a already tokenized stream. > Tokenizing can be extremely resource expensive, especially with bad > content, like vector remastered pdfs, hence why it should not block index > updates. > > Best Regards > Ian > > > > > > > On 4 November 2015 at 10:37, Julian Sedding <[email protected]> wrote: > >> Slightly off topic: why is/should Lucene Indexes always be async? I >> understand that requirement for a full-text index, which may need to >> do (slow) text-extraction. However, updates on a Lucene-based property >> index are usually very fast. So it is not obvious to me why they >> should not be synchronous. >> >> Thanks for any enlightening replies! >> >> Regards >> Julian >> >> On Wed, Nov 4, 2015 at 9:49 AM, Ian Boston <[email protected]> wrote: >> > On 4 November 2015 at 00:45, Davide Giannella <[email protected]> wrote: >> > >> >> Hello Team, >> >> >> >> Lucene index is always asynchronous and the async index could lag behind >> >> by definition. >> >> >> >> Sometimes we could have the same query better served by a property >> >> index, or traversing for example. In case the async index is lagging >> >> behind it could be that the traversing index is better suited to return >> >> the information as it will be more updated. >> >> >> >> As we know we run an async update every 5 seconds, we could come up with >> >> some algorithm to be used on the cost computing, that auto correct with >> >> some math the cost, increasing it the more the time passed since the >> >> last full execution of async index. >> >> >> >> WDYT? >> >> >> > >> > >> > Going down the property index route, for a DocumentMK instance will bloat >> > the DocumentStore further. That already consumes 60% of a production >> > repository and like many in DB inverted indexes is not an efficient >> storage >> > structure. It's probably ok for TarMK. >> > >> > Traversals are a problem for production. They will create random outages >> > under any sort of concurrent load. >> > >> > --- >> > If the way the indexing was performed is changed, it could make the index >> > NRT or real time depending on your point of view. eg. Local indexes, each >> > Oak index in the cluster becoming a shard with replication to cover >> > instance unavailability. No more indexing cycles, soft commits with each >> > instance using a FS Directory and a update queue replacing the async >> > indexing queue. Query by map reduce. It might have to copy on write to >> seed >> > new instances where the number of instances falls below 3. >> > >> > >> > >> > Best Regards >> > Ian >> > >> > >> > >> >> >> >> Davide >> >> >>
