[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Han Jiang (JIRA) Sat, 13 Jul 2013 04:04:03 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707709#comment-13707709
 ]


Han Jiang edited comment on LUCENE-3069 at 7/13/13 11:02 AM:
-------------------------------------------------------------

I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

Here is the bit width summary for "body" field:


||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|...|0|0|0|
|tot|57778057|73556459|73556459|

So we have 66.4% docFreq with df==1, and 78.5% with df==ttf.
Using following estimation, the old size for (df+ttf) here is 148.7MB.

When we steal one bit to mark whether df==ttf, it is reduced to 91.38MB.
When we use df==0 to mark df==ttf==1, wow, it is reduced to 70.31MB, thanks 
Robert!

{noformat}
old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * 
vIntByteSize(rownumber)
opt_size = col[2] * vIntByteSize(rownumber) + (rownumber == 1) ? 0 : col[3] * 
vIntByteSize(rownumber)
{noformat}


By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for 
example, 
when bit width ranges from 2 to 8(inclusive), since df is not large enough to 
create ForBlocks, 
we have to VInt encode each in-doc freq. For this 'body' field, -I think the 
index size we can reduce is about 67.5MB- 
-(here I only consider vInt block, since 1-bit ForBlock is usually small)- (ah 
I forgot we already steals bit for this case in Lucene41PBF.

I'll test this later.
                
      was (Author: billy):
    I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:

Here is the bit width summary for "body" field:


||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|...|0|0|0|
|tot|57778057|73556459|73556459|

So we have 66.4% docFreq with df==1, and 78.5% with df==ttf.
Using following estimation, the old size for (df+ttf) here is 148.7MB.

When we steal one bit to mark whether df==ttf, it is reduced to 91.38MB.
When we use df==0 to mark df==ttf==1, wow, it is reduced to 70.31MB, thanks 
Robert!

{noformat}
old_size = col[2] * vIntByteSize(rownumber)   + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * 
vIntByteSize(rownumber)
opt_size = col[2] * vIntByteSize(rownumber) + (rownumber == 1) ? 0 : col[3] * 
vIntByteSize(rownumber)
{noformat}


By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for 
example, 
when bit width ranges from 2 to 8(inclusive), since df is not large enough to 
create ForBlocks, 
we have to VInt encode each in-doc freq. For this 'body' field, --I think the 
index size we can reduce 
is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is 
usually small)-- (ah I forgot
we already steals bit for this case in Lucene41PBF.

I'll test this later.
                  
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
>                 Key: LUCENE-3069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3069
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Simon Willnauer
>            Assignee: Han Jiang
>              Labels: gsoc2013
>             Fix For: 4.4
>
>         Attachments: example.png, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Reply via email to