[jira] [Commented] (OAK-7947) Lazy loading of Lucene index files startup

Vikas Saurabh (JIRA) Mon, 10 Dec 2018 04:20:36 -0800


    [ 
https://issues.apache.org/jira/browse/OAK-7947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16714647#comment-16714647
 ]


Vikas Saurabh commented on OAK-7947:
------------------------------------

[~tmueller], the patch you had attached seems quite risky to me (as it touches 
quite a lot of places) and would solve "avoid opening index as long as possible 
wrt index definitions". If index definition can potentially answer the query 
and we want to open index to say get num docs or num docs per field then we 
would still copy in all index files. I've a at least one comment on the patch 
which I'd note at the end.
 
Maybe we could try a different approach - let index open happen as it happens 
today but copy required files right away (synchronously) and schedule rest of 
the files for later. Here's a snip of size sorted list of files from a 11G 
{{damAssetLucene}} index that [~chibulcu] had provided me from an AEM isntance:
{noformat}
$ ls -lsSh
total 11G
4.5G -rw-r--r-- 1 vsaurabh vsaurabh 4.5G Nov 23 12:20 _101z.fdt
4.5G -rw-r--r-- 1 vsaurabh vsaurabh 4.5G Nov 23 13:43 _1zt4.fdt
580M -rw-r--r-- 1 vsaurabh vsaurabh 580M Nov 23 12:20 _101z.pos
579M -rw-r--r-- 1 vsaurabh vsaurabh 579M Nov 23 13:43 _1zt4.pos
177M -rw-r--r-- 1 vsaurabh vsaurabh 177M Nov 23 13:44 _20z0.cfs
106M -rw-r--r-- 1 vsaurabh vsaurabh 106M Nov 23 12:20 _1x4o.cfs
 65M -rw-r--r-- 1 vsaurabh vsaurabh  65M Nov 23 13:44 _20bb.cfs
 29M -rw-r--r-- 1 vsaurabh vsaurabh  29M Nov 23 13:44 _217z.cfs
 16M -rw-r--r-- 1 vsaurabh vsaurabh  16M Nov 23 12:10 _101z.doc
 16M -rw-r--r-- 1 vsaurabh vsaurabh  16M Nov 23 12:20 _1zt4.doc
6.7M -rw-r--r-- 1 vsaurabh vsaurabh 6.7M Nov 23 13:44 _21ef.cfs
6.5M -rw-r--r-- 1 vsaurabh vsaurabh 6.5M Nov 23 13:44 _216f.cfs
6.3M -rw-r--r-- 1 vsaurabh vsaurabh 6.3M Nov 23 12:20 _101z.tim
5.9M -rw-r--r-- 1 vsaurabh vsaurabh 5.9M Nov 23 13:43 _1zt4.tim
5.9M -rw-r--r-- 1 vsaurabh vsaurabh 5.9M Nov 23 13:44 _21cy.cfs
4.4M -rw-r--r-- 1 vsaurabh vsaurabh 4.4M Nov 23 13:44 _21ab.cfs
3.8M -rw-r--r-- 1 vsaurabh vsaurabh 3.8M Nov 23 13:44 _21e4.cfs
3.7M -rw-r--r-- 1 vsaurabh vsaurabh 3.7M Nov 23 13:44 _21du.cfs
3.0M -rw-r--r-- 1 vsaurabh vsaurabh 3.0M Nov 23 13:44 _21dk.cfs
2.6M -rw-r--r-- 1 vsaurabh vsaurabh 2.6M Nov 23 13:44 _21f1.cfs
648K -rw-r--r-- 1 vsaurabh vsaurabh 647K Nov 23 12:10 _101z.dvd
424K -rw-r--r-- 1 vsaurabh vsaurabh 421K Nov 23 12:20 _1zt4.dvd
380K -rw-r--r-- 1 vsaurabh vsaurabh 378K Nov 23 12:20 _101z.fdx
372K -rw-r--r-- 1 vsaurabh vsaurabh 369K Nov 23 13:43 _1zt4.fdx
120K -rw-r--r-- 1 vsaurabh vsaurabh 120K Nov 23 13:44 _21f7.cfs
120K -rw-r--r-- 1 vsaurabh vsaurabh 120K Nov 23 13:44 _21f4.cfs
....
....
{noformat}

Looking at 
https://lucene.apache.org/core/4_7_1/core/org/apache/lucene/codecs/lucene46/package-summary.html,
 {{fdt}} files are stored field data and {{pos}} is positional data for indexed 
terms. Both these shouldn't get loaded only for cost evaluation afaict (we 
should probably try to confirm this btw). These 2 form the biggest chunk of the 
files - so, maybe only avoiding these to get copied over just to open an index 
would save us a lot of time for first time index open. Additionally, I think 
this approach is much less risky imo.

_patch review_
The changes in
{noformat}
public LuceneIndexDefinition getIndexDefinition(String indexPath){
{noformat}
when index isn't in index map is providing a definition which is visible in 
tree and not from stored index definition that gets stored. This would change 
the behavior of planner to start to use un-indexed index definition as well.

Afaics, the other changes are essentially doing lazy init and won't affect 
behavior afaics - but it does make it a little brittle to control to avoid 
index open (an unrelated part of code might start call some part that would in 
turn happily open the index).

> Lazy loading of Lucene index files startup
> ------------------------------------------
>
>                 Key: OAK-7947
>                 URL: https://issues.apache.org/jira/browse/OAK-7947
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene, query
>            Reporter: Thomas Mueller
>            Assignee: Thomas Mueller
>            Priority: Major
>         Attachments: OAK-7947.patch
>
>
> Right now, all Lucene index binaries are loaded on startup (I think when the 
> first query is run, to do cost calculation). This is a performance problem if 
> the index files are large, and need to be downloaded from the data store.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OAK-7947) Lazy loading of Lucene index files startup

Reply via email to