[ 
https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904642#action_12904642
 ] 

Toke Eskildsen commented on LUCENE-2369:
----------------------------------------

A status update might be in order. Switching to the current Lucene trunk with 
flex did require a lot of changes. Luckily it seems that they are all external 
so this could be a contrib.

The current implementation (not patch yet) seems to scale fairly well: Quick 
tests were made with a test-index with 5 fields of which one contained random 
Strings at average length 10 characters. No index optimize. The goal was to 
perform a Collator-based sorted search with fillFields=true (the terms used for 
sorting are returned along with the result) to get top-20 out of a lot of hits. 
Search-time was kept low by a field that was defined with the same term for 
every other document. The hardware was a Dell M6500 with [email protected], PC1333 
RAM, Intel X-25G2 SSD. The tests were performed in the background while coding 
and ZIPping 6M files.

2M document index, search hits 1M documents:
 * Initial exposed search: 1:16 minutes
 * Subsequent exposed searches: 45 ms
 * Total heap usage for Lucene + exposed structure: 21 MB
 * Initial default Lucene search: 3.3 s
 * Subsequent default Lucene searches: 2.1 s
 * Total heap usage for Lucene + field cache: 54 MB

20M document index, search hits 10M documents:
 * Initial exposed search: 15:27 minutes
 * Subsequent exposed searches: 370 ms
 * Total heap usage for Lucene + exposed structure: 209 MB
 * Initial default Lucene search: 28 s
 * Subsequent default Lucene searches: 20 s
 * Total heap usage for Lucene + field cache: 530 MB

200M document index, search hits 100M documents:
 * Initial exposed search: 186:31 minutes
 * Subsequent exposed searches: 3.5 s
 * Total heap usage for Lucene + exposed structure: 2300 MB
 * No data for default Lucene search as there was OOM with 6 GB of heap.

Observations:
 * The memory-requirement for the exposed structures is larger than the strict 
minimum. This is necessary in order to provide support for fast re-opening of 
indexes (the order of the terms in unchanged segments is reused). It seems like 
an obvious option to disable this cache.
 * The time for startup scales about n * log n with the number of terms. 
Comparing the 200M to 2M: 186 minutes / (200M * log(200M)) * (2M * log(2M)) ~= 
1:25 min(observed was 1:16 min).
 * No tests with 100M documents yet, but 1½ hour for build and 1.5GB of RAM 
would be the expected requirement. Having a startup-time of 1 hour+ is of 
course excessive, but if one made the calculated structures persistent (and 
thereby reduced restart time to near zero), this would work well with a classic 
"update once every night"-scenario. This would provide Collator-sorted search 
for a 100M document index on a machine with 2GB of RAM.

> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
>                 Key: LUCENE-2369
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2369
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>
> The current implementation of locale-based sort in Lucene uses the FieldCache 
> which keeps all sort terms in memory. Beside the huge memory overhead, 
> searching requires comparison of terms with collator.compare every time, 
> making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of 
> pre-sorted ordinals for the sort terms and a map from document-IDs to entries 
> in the sorted ordinals list. This results in very low memory overhead and 
> faster sorted searches, at the cost of increased startup-time. As the 
> ordinals can be resolved to terms after the sorting has been performed, this 
> approach supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 
> which contain previous discussions on the subject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to