[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Han Jiang (JIRA) Sat, 13 Jul 2013 03:06:40 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707709#comment-13707709
 ]


Han Jiang edited comment on LUCENE-3069 at 7/13/13 10:05 AM:
-------------------------------------------------------------

I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

Here is the bit width summary for "body" field:


||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|...|0|0|0|
|tot|57778057|73556459|73556459|

So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. 
Considering different bit size, for df+ttf encoding, totally 
it saves 57.3MB from 148.7MB, using following estimation:


{noformat}
old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * 
vIntByteSize(rownumber)
{noformat}


By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for 
example, 
when bit width ranges from 2 to 8(inclusive), since df is not large enough to 
create ForBlocks, 
we have to VInt encode each in-doc freq. For this 'body' field, I think the 
index size we can reduce 
is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is 
usually small).

For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for 
df+ttf only).
While the vInt frq block we can omit from PBF is about 95.8MB, I suppose.

I'll test this later.
                
      was (Author: billy):
    I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

Here is the bit width summary for "body" field:

||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|25| 0 | 0| 0|
|26| 0 | 0| 0|
|27| 0 | 0| 0|
|28| 0 | 0| 0|
|29| 0 | 0| 0|
|30| 0 | 0| 0|
|31| 0 | 0| 0|
|32| 0 | 0| 0|
|...|0|0|0|
|tot|57778057|73556459|73556459|

So we have 66.4% docFreq with df==1, and 78.5% with df==ttf.
Considering different bit size, for df+ttf encoding, 
totally it saves 57.3MB from 148.7MB, using following estimation:

{noformat}
old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * 
vIntByteSize(rownumber)
{noformat}

By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for 
example, 
when bit width ranges from 2 to 8(inclusive), since df is not large enough to 
create ForBlocks, 
we have to VInt encode each in-doc freq. For this 'body' field, I think the 
index size we can reduce 
is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is 
usually small).

For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for 
df+ttf only).
While the vInt frq block we can omit from PBF is about 95.8MB, I suppose.

I'll test this later.
                  
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
>                 Key: LUCENE-3069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3069
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Simon Willnauer
>            Assignee: Han Jiang
>              Labels: gsoc2013
>             Fix For: 4.4
>
>         Attachments: example.png, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Reply via email to