[jira] [Commented] (LUCENE-4764) Faster but more RAM/Disk consuming DocValuesFormat for facets

Shai Erera (JIRA) Sat, 09 Feb 2013 07:59:15 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575189#comment-13575189
 ]


Shai Erera commented on LUCENE-4764:
------------------------------------

bq. i wonder how it would perform if it wrote and kept in ram packed ints, 
since it knows whats in the byte[]

We've tried that in the past. I don't remember on which issue we posted the 
results, but they were not compelling. I.e. what we tried is to keep the ints 
as int[] vs packed-ints. int[] performed (IIRC) 50% faster, while packed-int 
only ~6-10% faster. Also, their RAM footprint was very close. The problem is 
that packed-ints is only good if you know something about the numbers, i.e. 
their size, distribution etc. But with category ordinals, on this Wikipedia 
index, there's nothing "special" about them. Really every document keeps close 
to arbitrary integers between 1 - 2.2M ...

If the following math holds -- 25 ords per document (that's 100 bytes/doc) x 
6.6M documents -- that's going to be ~660MB (offsets not included). I suspect 
that packed-ints will consume approximately the same size (at least, per past 
results) but won't yield significantly better performance. Therefore if we want 
to cache anything at the int level, we should do an int[] caching aggregator.

Mike, correct me if I'm wrong.
                
> Faster but more RAM/Disk consuming DocValuesFormat for facets
> -------------------------------------------------------------
>
>                 Key: LUCENE-4764
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4764
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.2, 5.0
>
>         Attachments: LUCENE-4764.patch
>
>
> The new default DV format for binary fields has much more
> RAM-efficient encoding of the address for each document ... but it's
> also a bit slower at decode time, which affects facets because we
> decode for every collected docID.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4764) Faster but more RAM/Disk consuming DocValuesFormat for facets

Reply via email to