[ https://issues.apache.org/jira/browse/OAK-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chetan Mehrotra resolved OAK-3092. ---------------------------------- Resolution: Fixed > Cache recently extracted text to avoid duplicate extraction > ----------------------------------------------------------- > > Key: OAK-3092 > URL: https://issues.apache.org/jira/browse/OAK-3092 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene > Reporter: Chetan Mehrotra > Assignee: Chetan Mehrotra > Labels: performance > Fix For: 1.3.11, 1.2.8, 1.0.24 > > Attachments: OAK-3092-v1.patch, OAK-3092-v2.patch > > > It can happen that text can be extracted from same binary multiple times in a > given indexing cycle. This can happen due to 2 reasons > # Multiple Lucene indexes indexing same node - A system might have multiple > Lucene indexes e.g. a global Lucene index and an index for specific nodeType. > In a given indexing cycle same file would be picked up by both index > definition and both would extract same text > # Aggregation - With Index time aggregation same file get picked up multiple > times due to aggregation rules > To avoid the wasted effort for duplicate text extraction from same file in a > given indexing cycle it would be better to have an expiring cache which can > hold on to extracted text content for some time. The cache should have > following features > # Limit on total size > # Way to expire the content using [Timed > Evicition|https://code.google.com/p/guava-libraries/wiki/CachesExplained#Timed_Eviction] > - As chances of same file getting picked up are high only for a given > indexing cycle it would be better to expire the cache entries after some time > to avoid hogging memory unnecessarily > Such a cache would provide following benefit > # Avoid duplicate text extraction - Text extraction is costly and has to be > minimized on critical path of {{indexEditor}} > # Avoid expensive IO specially if binary content are to be fetched from a > remote {{BlobStore}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)