Chetan Mehrotra created OAK-3092:
------------------------------------

             Summary: Cache recently extracted text to avoid duplicate 
extraction
                 Key: OAK-3092
                 URL: https://issues.apache.org/jira/browse/OAK-3092
             Project: Jackrabbit Oak
          Issue Type: Improvement
          Components: lucene
            Reporter: Chetan Mehrotra
            Assignee: Chetan Mehrotra
             Fix For: 1.2.4, 1.3.3, 1.0.18


It can happen that text can be extracted from same binary multiple times in a 
given indexing cycle. This can happen due to 2 reasons

# Multiple Lucene indexes indexing same node - A system might have multiple 
Lucene indexes e.g. a global Lucene index and an index for specific nodeType. 
In a given indexing cycle same file would be picked up by both index definition 
and both would extract same text
# Aggregation - With Index time aggregation same file get picked up multiple 
times due to aggregation rules

To avoid the wasted effort for duplicate text extraction from same file in a 
given indexing cycle it would be better to have an expiring cache which can 
hold on to extracted text content for some time. The cache should have 
following features
# Limit on total size
# Way to expire the content using [Timed 
Evicition|https://code.google.com/p/guava-libraries/wiki/CachesExplained#Timed_Eviction]
 - As chances of same file getting picked up are high only for a given indexing 
cycle it would be better to expire the cache entries after some time to avoid 
hogging memory unnecessarily 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to