[ https://issues.apache.org/jira/browse/OAK-7947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16714647#comment-16714647 ]
Vikas Saurabh commented on OAK-7947: ------------------------------------ [~tmueller], the patch you had attached seems quite risky to me (as it touches quite a lot of places) and would solve "avoid opening index as long as possible wrt index definitions". If index definition can potentially answer the query and we want to open index to say get num docs or num docs per field then we would still copy in all index files. I've a at least one comment on the patch which I'd note at the end. Maybe we could try a different approach - let index open happen as it happens today but copy required files right away (synchronously) and schedule rest of the files for later. Here's a snip of size sorted list of files from a 11G {{damAssetLucene}} index that [~chibulcu] had provided me from an AEM isntance: {noformat} $ ls -lsSh total 11G 4.5G -rw-r--r-- 1 vsaurabh vsaurabh 4.5G Nov 23 12:20 _101z.fdt 4.5G -rw-r--r-- 1 vsaurabh vsaurabh 4.5G Nov 23 13:43 _1zt4.fdt 580M -rw-r--r-- 1 vsaurabh vsaurabh 580M Nov 23 12:20 _101z.pos 579M -rw-r--r-- 1 vsaurabh vsaurabh 579M Nov 23 13:43 _1zt4.pos 177M -rw-r--r-- 1 vsaurabh vsaurabh 177M Nov 23 13:44 _20z0.cfs 106M -rw-r--r-- 1 vsaurabh vsaurabh 106M Nov 23 12:20 _1x4o.cfs 65M -rw-r--r-- 1 vsaurabh vsaurabh 65M Nov 23 13:44 _20bb.cfs 29M -rw-r--r-- 1 vsaurabh vsaurabh 29M Nov 23 13:44 _217z.cfs 16M -rw-r--r-- 1 vsaurabh vsaurabh 16M Nov 23 12:10 _101z.doc 16M -rw-r--r-- 1 vsaurabh vsaurabh 16M Nov 23 12:20 _1zt4.doc 6.7M -rw-r--r-- 1 vsaurabh vsaurabh 6.7M Nov 23 13:44 _21ef.cfs 6.5M -rw-r--r-- 1 vsaurabh vsaurabh 6.5M Nov 23 13:44 _216f.cfs 6.3M -rw-r--r-- 1 vsaurabh vsaurabh 6.3M Nov 23 12:20 _101z.tim 5.9M -rw-r--r-- 1 vsaurabh vsaurabh 5.9M Nov 23 13:43 _1zt4.tim 5.9M -rw-r--r-- 1 vsaurabh vsaurabh 5.9M Nov 23 13:44 _21cy.cfs 4.4M -rw-r--r-- 1 vsaurabh vsaurabh 4.4M Nov 23 13:44 _21ab.cfs 3.8M -rw-r--r-- 1 vsaurabh vsaurabh 3.8M Nov 23 13:44 _21e4.cfs 3.7M -rw-r--r-- 1 vsaurabh vsaurabh 3.7M Nov 23 13:44 _21du.cfs 3.0M -rw-r--r-- 1 vsaurabh vsaurabh 3.0M Nov 23 13:44 _21dk.cfs 2.6M -rw-r--r-- 1 vsaurabh vsaurabh 2.6M Nov 23 13:44 _21f1.cfs 648K -rw-r--r-- 1 vsaurabh vsaurabh 647K Nov 23 12:10 _101z.dvd 424K -rw-r--r-- 1 vsaurabh vsaurabh 421K Nov 23 12:20 _1zt4.dvd 380K -rw-r--r-- 1 vsaurabh vsaurabh 378K Nov 23 12:20 _101z.fdx 372K -rw-r--r-- 1 vsaurabh vsaurabh 369K Nov 23 13:43 _1zt4.fdx 120K -rw-r--r-- 1 vsaurabh vsaurabh 120K Nov 23 13:44 _21f7.cfs 120K -rw-r--r-- 1 vsaurabh vsaurabh 120K Nov 23 13:44 _21f4.cfs .... .... {noformat} Looking at https://lucene.apache.org/core/4_7_1/core/org/apache/lucene/codecs/lucene46/package-summary.html, {{fdt}} files are stored field data and {{pos}} is positional data for indexed terms. Both these shouldn't get loaded only for cost evaluation afaict (we should probably try to confirm this btw). These 2 form the biggest chunk of the files - so, maybe only avoiding these to get copied over just to open an index would save us a lot of time for first time index open. Additionally, I think this approach is much less risky imo. _patch review_ The changes in {noformat} public LuceneIndexDefinition getIndexDefinition(String indexPath){ {noformat} when index isn't in index map is providing a definition which is visible in tree and not from stored index definition that gets stored. This would change the behavior of planner to start to use un-indexed index definition as well. Afaics, the other changes are essentially doing lazy init and won't affect behavior afaics - but it does make it a little brittle to control to avoid index open (an unrelated part of code might start call some part that would in turn happily open the index). > Lazy loading of Lucene index files startup > ------------------------------------------ > > Key: OAK-7947 > URL: https://issues.apache.org/jira/browse/OAK-7947 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene, query > Reporter: Thomas Mueller > Assignee: Thomas Mueller > Priority: Major > Attachments: OAK-7947.patch > > > Right now, all Lucene index binaries are loaded on startup (I think when the > first query is run, to do cost calculation). This is a performance problem if > the index files are large, and need to be downloaded from the data store. -- This message was sent by Atlassian JIRA (v7.6.3#76005)