[
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707709#comment-13707709
]
Han Jiang edited comment on LUCENE-3069 at 7/13/13 11:02 AM:
-------------------------------------------------------------
I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:
Here is the bit width summary for "body" field:
||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|...|0|0|0|
|tot|57778057|73556459|73556459|
So we have 66.4% docFreq with df==1, and 78.5% with df==ttf.
Using following estimation, the old size for (df+ttf) here is 148.7MB.
When we steal one bit to mark whether df==ttf, it is reduced to 91.38MB.
When we use df==0 to mark df==ttf==1, wow, it is reduced to 70.31MB, thanks
Robert!
{noformat}
old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) *
vIntByteSize(rownumber)
opt_size = col[2] * vIntByteSize(rownumber) + (rownumber == 1) ? 0 : col[3] *
vIntByteSize(rownumber)
{noformat}
By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for
example,
when bit width ranges from 2 to 8(inclusive), since df is not large enough to
create ForBlocks,
we have to VInt encode each in-doc freq. For this 'body' field, -I think the
index size we can reduce is about 67.5MB-
-(here I only consider vInt block, since 1-bit ForBlock is usually small)- (ah
I forgot we already steals bit for this case in Lucene41PBF.
I'll test this later.
was (Author: billy):
I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:
Here is the bit width summary for "body" field:
||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|...|0|0|0|
|tot|57778057|73556459|73556459|
So we have 66.4% docFreq with df==1, and 78.5% with df==ttf.
Using following estimation, the old size for (df+ttf) here is 148.7MB.
When we steal one bit to mark whether df==ttf, it is reduced to 91.38MB.
When we use df==0 to mark df==ttf==1, wow, it is reduced to 70.31MB, thanks
Robert!
{noformat}
old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) *
vIntByteSize(rownumber)
opt_size = col[2] * vIntByteSize(rownumber) + (rownumber == 1) ? 0 : col[3] *
vIntByteSize(rownumber)
{noformat}
By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for
example,
when bit width ranges from 2 to 8(inclusive), since df is not large enough to
create ForBlocks,
we have to VInt encode each in-doc freq. For this 'body' field, --I think the
index size we can reduce
is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is
usually small)-- (ah I forgot
we already steals bit for this case in Lucene41PBF.
I'll test this later.
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index, core/search
> Affects Versions: 4.0-ALPHA
> Reporter: Simon Willnauer
> Assignee: Han Jiang
> Labels: gsoc2013
> Fix For: 4.4
>
> Attachments: example.png, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a
> delta codec file for scanning to terms. Some environments have enough memory
> available to keep the entire FST based term dict in memory. We should add a
> TermDictionary implementation that encodes all needed information for each
> term into the FST (custom fst.Output) and builds a FST from the entire term
> not just the delta.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]