Chetan Mehrotra created OAK-3092: ------------------------------------ Summary: Cache recently extracted text to avoid duplicate extraction Key: OAK-3092 URL: https://issues.apache.org/jira/browse/OAK-3092 Project: Jackrabbit Oak Issue Type: Improvement Components: lucene Reporter: Chetan Mehrotra Assignee: Chetan Mehrotra Fix For: 1.2.4, 1.3.3, 1.0.18
It can happen that text can be extracted from same binary multiple times in a given indexing cycle. This can happen due to 2 reasons # Multiple Lucene indexes indexing same node - A system might have multiple Lucene indexes e.g. a global Lucene index and an index for specific nodeType. In a given indexing cycle same file would be picked up by both index definition and both would extract same text # Aggregation - With Index time aggregation same file get picked up multiple times due to aggregation rules To avoid the wasted effort for duplicate text extraction from same file in a given indexing cycle it would be better to have an expiring cache which can hold on to extracted text content for some time. The cache should have following features # Limit on total size # Way to expire the content using [Timed Evicition|https://code.google.com/p/guava-libraries/wiki/CachesExplained#Timed_Eviction] - As chances of same file getting picked up are high only for a given indexing cycle it would be better to expire the cache entries after some time to avoid hogging memory unnecessarily -- This message was sent by Atlassian JIRA (v6.3.4#6332)