[ https://issues.apache.org/jira/browse/OAK-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14996247#comment-14996247 ]
Chetan Mehrotra edited comment on OAK-3092 at 11/9/15 9:29 AM: --------------------------------------------------------------- [patch|^OAK-3092-v1.patch] implementing above mentioned approach * Exposed 2 OSGi config - Size of cache and expiry time for cached entries * Setting cache size to 0 would disable the cache. * By default the cache is disabled. Once the feature is validated in actual use the default would be changed * CacheStateMBean is exposed if the cache is enabled The patch would need minor tweaks once OAK-3598 is resolved [~alexparvulescu] [~edivad] Can you review the patch? was (Author: chetanm): [patch|^OAK-3092-v1.patch] implementing above mentioned approach * Exposed 2 OSGi config - Size of cache and expiry time for cached entries * Setting cache size to 0 would disable the cache. * By default the cache is disabled. Once the feature is validated in actual use the default would be changed * CacheStateMBean is exposed if the cache is enabled [~alexparvulescu] [~edivad] Can you review the patch? > Cache recently extracted text to avoid duplicate extraction > ----------------------------------------------------------- > > Key: OAK-3092 > URL: https://issues.apache.org/jira/browse/OAK-3092 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene > Reporter: Chetan Mehrotra > Assignee: Chetan Mehrotra > Labels: performance > Fix For: 1.3.10 > > Attachments: OAK-3092-v1.patch > > > It can happen that text can be extracted from same binary multiple times in a > given indexing cycle. This can happen due to 2 reasons > # Multiple Lucene indexes indexing same node - A system might have multiple > Lucene indexes e.g. a global Lucene index and an index for specific nodeType. > In a given indexing cycle same file would be picked up by both index > definition and both would extract same text > # Aggregation - With Index time aggregation same file get picked up multiple > times due to aggregation rules > To avoid the wasted effort for duplicate text extraction from same file in a > given indexing cycle it would be better to have an expiring cache which can > hold on to extracted text content for some time. The cache should have > following features > # Limit on total size > # Way to expire the content using [Timed > Evicition|https://code.google.com/p/guava-libraries/wiki/CachesExplained#Timed_Eviction] > - As chances of same file getting picked up are high only for a given > indexing cycle it would be better to expire the cache entries after some time > to avoid hogging memory unnecessarily > Such a cache would provide following benefit > # Avoid duplicate text extraction - Text extraction is costly and has to be > minimized on critical path of {{indexEditor}} > # Avoid expensive IO specially if binary content are to be fetched from a > remote {{BlobStore}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)