[
https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904642#action_12904642
]
Toke Eskildsen commented on LUCENE-2369:
----------------------------------------
A status update might be in order. Switching to the current Lucene trunk with
flex did require a lot of changes. Luckily it seems that they are all external
so this could be a contrib.
The current implementation (not patch yet) seems to scale fairly well: Quick
tests were made with a test-index with 5 fields of which one contained random
Strings at average length 10 characters. No index optimize. The goal was to
perform a Collator-based sorted search with fillFields=true (the terms used for
sorting are returned along with the result) to get top-20 out of a lot of hits.
Search-time was kept low by a field that was defined with the same term for
every other document. The hardware was a Dell M6500 with [email protected], PC1333
RAM, Intel X-25G2 SSD. The tests were performed in the background while coding
and ZIPping 6M files.
2M document index, search hits 1M documents:
* Initial exposed search: 1:16 minutes
* Subsequent exposed searches: 45 ms
* Total heap usage for Lucene + exposed structure: 21 MB
* Initial default Lucene search: 3.3 s
* Subsequent default Lucene searches: 2.1 s
* Total heap usage for Lucene + field cache: 54 MB
20M document index, search hits 10M documents:
* Initial exposed search: 15:27 minutes
* Subsequent exposed searches: 370 ms
* Total heap usage for Lucene + exposed structure: 209 MB
* Initial default Lucene search: 28 s
* Subsequent default Lucene searches: 20 s
* Total heap usage for Lucene + field cache: 530 MB
200M document index, search hits 100M documents:
* Initial exposed search: 186:31 minutes
* Subsequent exposed searches: 3.5 s
* Total heap usage for Lucene + exposed structure: 2300 MB
* No data for default Lucene search as there was OOM with 6 GB of heap.
Observations:
* The memory-requirement for the exposed structures is larger than the strict
minimum. This is necessary in order to provide support for fast re-opening of
indexes (the order of the terms in unchanged segments is reused). It seems like
an obvious option to disable this cache.
* The time for startup scales about n * log n with the number of terms.
Comparing the 200M to 2M: 186 minutes / (200M * log(200M)) * (2M * log(2M)) ~=
1:25 min(observed was 1:16 min).
* No tests with 100M documents yet, but 1½ hour for build and 1.5GB of RAM
would be the expected requirement. Having a startup-time of 1 hour+ is of
course excessive, but if one made the calculated structures persistent (and
thereby reduced restart time to near zero), this would work well with a classic
"update once every night"-scenario. This would provide Collator-sorted search
for a 100M document index on a machine with 2GB of RAM.
> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
> Key: LUCENE-2369
> URL: https://issues.apache.org/jira/browse/LUCENE-2369
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Toke Eskildsen
> Priority: Minor
>
> The current implementation of locale-based sort in Lucene uses the FieldCache
> which keeps all sort terms in memory. Beside the huge memory overhead,
> searching requires comparison of terms with collator.compare every time,
> making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of
> pre-sorted ordinals for the sort terms and a map from document-IDs to entries
> in the sorted ordinals list. This results in very low memory overhead and
> faster sorted searches, at the cost of increased startup-time. As the
> ordinals can be resolved to terms after the sorting has been performed, this
> approach supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335
> which contain previous discussions on the subject.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]