[
https://issues.apache.org/jira/browse/LUCENE-6840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960514#comment-14960514
]
Adrien Grand commented on LUCENE-6840:
--------------------------------------
I did a quick benchmark with the geonames dataset (10740477 documents), with
only two doc values fields:
- name as a binary dv field
- alternatenames as a sorted set dv field
Then I measured disk/memory usage and tried to sort on both fields (using
SortedSetSelector.Type.MIN for the multi-valued field) with a MatchAllDocsQuery:
||branch||index size(MB)||memory usage(MB)||time to sort on "name" (ms)||time
to sort on "alternatenames" (ms)||
|trunk|311|33.1|4750|3570|
|patch|319 (+2%)|1.4 (-96%)|5390 (+13%)|4000 (+12%)|
Maybe there are things we can optimize with the patch, but even with these
numbers I think this patch has a better trade-off: I am not very happy that the
current format takes more than 3 bytes of memory per document for only two doc
values fields.
> Put ord indexes of doc values on disk
> -------------------------------------
>
> Key: LUCENE-6840
> URL: https://issues.apache.org/jira/browse/LUCENE-6840
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Attachments: LUCENE-6840.patch
>
>
> Currently we still load monotonic blocks into memory to map doc ids to an
> offset on disk. Since these data structures are usually consumed sequentially
> I would like to investigate putting them to disk.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]